Computer Vision and Pattern Recognition 150
☆ AniDoc: Animation Creation Made Easier
Yihao Meng, Hao Ouyang, Hanlin Wang, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Zhiheng Liu, Yujun Shen, Huamin Qu
The production of 2D animation follows an industry-standard workflow,
encompassing four essential stages: character design, keyframe animation,
in-betweening, and coloring. Our research focuses on reducing the labor costs
in the above process by harnessing the potential of increasingly powerful
generative AI. Using video diffusion models as the foundation, AniDoc emerges
as a video line art colorization tool, which automatically converts sketch
sequences into colored animations following the reference character
specification. Our model exploits correspondence matching as an explicit
guidance, yielding strong robustness to the variations (e.g., posture) between
the reference character and each line art frame. In addition, our model could
even automate the in-betweening process, such that users can easily create a
temporally consistent animation by simply providing a character image as well
as the start and end sketches. Our code is available at:
https://yihao-meng.github.io/AniDoc_demo.
comment: Project page and code: https://yihao-meng.github.io/AniDoc_demo
☆ Learning from Massive Human Videos for Universal Humanoid Pose Control
Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, Yue Wang
Scalable learning of humanoid robots is crucial for their deployment in
real-world applications. While traditional approaches primarily rely on
reinforcement learning or teleoperation to achieve whole-body control, they are
often limited by the diversity of simulated environments and the high costs of
demonstration collection. In contrast, human videos are ubiquitous and present
an untapped source of semantic and motion information that could significantly
enhance the generalization capabilities of humanoid robots. This paper
introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot
poses with corresponding text-based motion descriptions, designed to leverage
this abundant data. Humanoid-X is curated through a comprehensive pipeline:
data mining from the Internet, video caption generation, motion retargeting of
humans to humanoid robots, and policy learning for real-world deployment. With
Humanoid-X, we further train a large humanoid model, UH-1, which takes text
instructions as input and outputs corresponding actions to control a humanoid
robot. Extensive simulated and real-world experiments validate that our
scalable training approach leads to superior generalization in text-based
humanoid control, marking a significant step toward adaptable, real-world-ready
humanoid robots.
☆ Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Humans possess the visual-spatial intelligence to remember spaces from
sequential visual observations. However, can Multimodal Large Language Models
(MLLMs) trained on million-scale video datasets also ``think in space'' from
videos? We present a novel video-based visual-spatial intelligence benchmark
(VSI-Bench) of over 5,000 question-answer pairs, and find that MLLMs exhibit
competitive - though subhuman - visual-spatial intelligence. We probe models to
express how they think in space both linguistically and visually and find that
while spatial reasoning capabilities remain the primary bottleneck for MLLMs to
reach higher benchmark performance, local world models and spatial awareness do
emerge within these models. Notably, prevailing linguistic reasoning techniques
(e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve
performance, whereas explicitly generating cognitive maps during
question-answering enhances MLLMs' spatial distance ability.
comment: Project page:
https://vision-x-nyu.github.io/thinking-in-space.github.io/
☆ Autoregressive Video Generation without Vector Quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang
This paper presents a novel approach that enables autoregressive video
generation with high efficiency. We propose to reformulate the video generation
problem as a non-quantized autoregressive modeling of temporal frame-by-frame
prediction and spatial set-by-set prediction. Unlike raster-scan prediction in
prior autoregressive models or joint distribution modeling of fixed-length
tokens in diffusion models, our approach maintains the causal property of
GPT-style models for flexible in-context capabilities, while leveraging
bidirectional modeling within individual frames for efficiency. With the
proposed approach, we train a novel video autoregressive model without vector
quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior
autoregressive video models in data efficiency, inference speed, visual
fidelity, and video fluency, even with a much smaller model capacity, i.e.,
0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models
in text-to-image generation tasks, with a significantly lower training cost.
Additionally, NOVA generalizes well across extended video durations and enables
diverse zero-shot applications in one unified model. Code and models are
publicly available at https://github.com/baaivision/NOVA.
comment: 22 pages, 16 figures
☆ E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling
Zhihang Yuan, Yuzhang Shang, Hanling Zhang, Tongcheng Fang, Rui Xie, Bingxin Xu, Yan Yan, Shengen Yan, Guohao Dai, Yu Wang
Recent advances in autoregressive (AR) models with continuous tokens for
image generation show promising results by eliminating the need for discrete
tokenization. However, these models face efficiency challenges due to their
sequential token generation nature and reliance on computationally intensive
diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive
Image Generation via Multistage Modeling), an approach that addresses these
limitations through two intertwined innovations: (1) a stage-wise continuous
token generation strategy that reduces computational complexity and provides
progressively refined token maps as hierarchical conditions, and (2) a
multistage flow-based distribution modeling method that transforms only
partial-denoised distributions at each stage comparing to complete denoising in
normal diffusion models. Holistically, ECAR operates by generating tokens at
increasing resolutions while simultaneously denoising the image at each stage.
This design not only reduces token-to-image transformation cost by a factor of
the stage number but also enables parallel processing at the token level. Our
approach not only enhances computational efficiency but also aligns naturally
with image generation principles by operating in continuous token space and
following a hierarchical generation process from coarse to fine details.
Experimental results demonstrate that ECAR achieves comparable image quality to
DiT Peebles & Xie [2023] while requiring 10$\times$ FLOPs reduction and
5$\times$ speedup to generate a 256$\times$256 image.
☆ FashionComposer: Compositional Fashion Image Generation
We present FashionComposer for compositional fashion image generation. Unlike
previous methods, FashionComposer is highly flexible. It takes multi-modal
input (i.e., text prompt, parametric human model, garment image, and face
image) and supports personalizing the appearance, pose, and figure of the human
and assigning multiple garments in one pass. To achieve this, we first develop
a universal framework capable of handling diverse input modalities. We
construct scaled training data to enhance the model's robust compositional
capabilities. To accommodate multiple reference images (garments and faces)
seamlessly, we organize these references in a single image as an "asset
library" and employ a reference UNet to extract appearance features. To inject
the appearance features into the correct pixels in the generated result, we
propose subject-binding attention. It binds the appearance features from
different "assets" with the corresponding text features. In this way, the model
could understand each asset according to their semantics, supporting arbitrary
numbers and types of reference images. As a comprehensive solution,
FashionComposer also supports many other applications like human album
generation, diverse virtual try-on tasks, etc.
comment: https://sihuiji.github.io/FashionComposer-Page
☆ VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Recent progress in generative diffusion models has greatly advanced
text-to-video generation. While text-to-video models trained on large-scale,
diverse datasets can produce varied outputs, these generations often deviate
from user preferences, highlighting the need for preference alignment on
pre-trained models. Although Direct Preference Optimization (DPO) has
demonstrated significant improvements in language and image generation, we
pioneer its adaptation to video diffusion models and propose a VideoDPO
pipeline by making several key adjustments. Unlike previous image alignment
methods that focus solely on either (i) visual quality or (ii) semantic
alignment between text and videos, we comprehensively consider both dimensions
and construct a preference score accordingly, which we term the OmniScore. We
design a pipeline to automatically collect preference pair data based on the
proposed OmniScore and discover that re-weighting these pairs based on the
score significantly impacts overall preference alignment. Our experiments
demonstrate substantial improvements in both visual quality and semantic
alignment, ensuring that no preference aspect is neglected. Code and data will
be shared at https://videodpo.github.io/.
☆ MegaSynth: Scaling Up 3D Scene Reconstruction with Synthesized Data
Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, Jiuxiang Gu, Qixing Huang, Georgios Pavlakos, Hao Tan
We propose scaling up 3D scene reconstruction by training with synthesized
data. At the core of our work is MegaSynth, a procedurally generated 3D dataset
comprising 700K scenes - over 50 times larger than the prior real dataset DL3DV
- dramatically scaling the training data. To enable scalable data generation,
our key idea is eliminating semantic information, removing the need to model
complex semantic priors such as object affordances and scene composition.
Instead, we model scenes with basic spatial structures and geometry primitives,
offering scalability. Besides, we control data complexity to facilitate
training while loosely aligning it with real-world data distribution to benefit
real-world generalization. We explore training LRMs with both MegaSynth and
available real data. Experiment results show that joint training or
pre-training with MegaSynth improves reconstruction quality by 1.2 to 1.8 dB
PSNR across diverse image domains. Moreover, models trained solely on MegaSynth
perform comparably to those trained on real data, underscoring the low-level
nature of 3D reconstruction. Additionally, we provide an in-depth analysis of
MegaSynth's properties for enhancing model capability, training stability, and
generalization.
comment: Project page: https://hwjiang1510.github.io/MegaSynth/
★ MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, Zhuang Liu
In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a
simple and effective extension to visual instruction tuning that enables a
pretrained LLM to quickly morph into an unified autoregressive model capable of
generating both text and visual tokens. VPiT teaches an LLM to predict discrete
text tokens and continuous visual tokens from any input sequence of image and
text data curated in an instruction-following format. Our empirical
investigation reveals several intriguing properties of VPiT: (1) visual
generation ability emerges as a natural byproduct of improved visual
understanding, and can be unlocked efficiently with a small amount of
generation data; (2) while we find understanding and generation to be mutually
beneficial, understanding data contributes to both capabilities more
effectively than generation data. Building upon these findings, we train our
MetaMorph model and achieve competitive performance on both visual
understanding and generation. In visual generation, MetaMorph can leverage the
world knowledge and reasoning abilities gained from LLM pretraining, and
overcome common failure modes exhibited by other generation models. Our results
suggest that LLMs may have strong "prior" vision capabilities that can be
efficiently adapted to both visual understanding and generation with a
relatively simple instruction tuning process.
comment: Project page at tsb0601.github.io/metamorph
☆ AKiRa: Augmentation Kit on Rays for optical video generation
Recent advances in text-conditioned video diffusion have greatly improved
video quality. However, these methods offer limited or sometimes no control to
users on camera aspects, including dynamic camera motion, zoom, distorted lens
and focus shifts. These motion and optical aspects are crucial for adding
controllability and cinematic elements to generation frameworks, ultimately
resulting in visual content that draws focus, enhances mood, and guides
emotions according to filmmakers' controls. In this paper, we aim to close the
gap between controllable video generation and camera optics. To achieve this,
we propose AKiRa (Augmentation Kit on Rays), a novel augmentation framework
that builds and trains a camera adapter with a complex camera model over an
existing video generation backbone. It enables fine-tuned control over camera
motion as well as complex optical parameters (focal length, distortion,
aperture) to achieve cinematic effects such as zoom, fisheye effect, and bokeh.
Extensive experiments demonstrate AKiRa's effectiveness in combining and
composing camera optics while outperforming all state-of-the-art methods. This
work sets a new landmark in controlled and optically enhanced video generation,
paving the way for future optical video generation methods.
☆ MCMat: Multiview-Consistent and Physically Accurate PBR Material Generation
Shenhao Zhu, Lingteng Qiu, Xiaodong Gu, Zhengyi Zhao, Chao Xu, Yuxiao He, Zhe Li, Xiaoguang Han, Yao Yao, Xun Cao, Siyu Zhu, Weihao Yuan, Zilong Dong, Hao Zhu
Existing 2D methods utilize UNet-based diffusion models to generate
multi-view physically-based rendering (PBR) maps but struggle with multi-view
inconsistency, while some 3D methods directly generate UV maps, encountering
generalization issues due to the limited 3D data. To address these problems, we
propose a two-stage approach, including multi-view generation and UV materials
refinement. In the generation stage, we adopt a Diffusion Transformer (DiT)
model to generate PBR materials, where both the specially designed multi-branch
DiT and reference-based DiT blocks adopt a global attention mechanism to
promote feature interaction and fusion between different views, thereby
improving multi-view consistency. In addition, we adopt a PBR-based diffusion
loss to ensure that the generated materials align with realistic physical
principles. In the refinement stage, we propose a material-refined DiT that
performs inpainting in empty areas and enhances details in UV space. Except for
the normal condition, this refinement also takes the material map from the
generation stage as an additional condition to reduce the learning difficulty
and improve generalization. Extensive experiments show that our method achieves
state-of-the-art performance in texturing 3D objects with PBR materials and
provides significant advantages for graphics relighting applications. Project
Page: https://lingtengqiu.github.io/2024/MCMat/
comment: Project Page: https://lingtengqiu.github.io/2024/MCMat/
☆ Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
The visual understanding are often approached from 3 granular levels: image,
patch and pixel. Visual Tokenization, trained by self-supervised reconstructive
learning, compresses visual data by codebook in patch-level with marginal
information loss, but the visual tokens does not have semantic meaning. Open
Vocabulary semantic segmentation benefits from the evolving Vision-Language
models (VLMs) with strong image zero-shot capability, but transferring
image-level to pixel-level understanding remains an imminent challenge. In this
paper, we treat segmentation as tokenizing pixels and study a united perceptual
and semantic token compression for all granular understanding and consequently
facilitate open vocabulary semantic segmentation. Referring to the cognitive
process of pretrained VLM where the low-level features are progressively
composed to high-level semantics, we propose Feature Pyramid Tokenization (PAT)
to cluster and represent multi-resolution feature by learnable codebooks and
then decode them by joint learning pixel reconstruction and semantic
segmentation. We design loosely coupled pixel and semantic learning branches.
The pixel branch simulates bottom-up composition and top-down visualization of
codebook tokens, while the semantic branch collectively fuse hierarchical
codebooks as auxiliary segmentation guidance. Our experiments show that PAT
enhances the semantic intuition of VLM feature pyramid, improves performance
over the baseline segmentation model and achieves competitive performance on
open vocabulary semantic segmentation benchmark. Our model is
parameter-efficient for VLM integration and flexible for the independent
tokenization. We hope to give inspiration not only on improving segmentation
but also on semantic visual token utilization.
comment: 6 pages, 6 figures
☆ AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities
Geospatial models must adapt to the diversity of Earth observation data in
terms of resolutions, scales, and modalities. However, existing approaches
expect fixed input configurations, which limits their practical applicability.
We propose AnySat, a multimodal model based on joint embedding predictive
architecture (JEPA) and resolution-adaptive spatial encoders, allowing us to
train a single model on highly heterogeneous data in a self-supervised manner.
To demonstrate the advantages of this unified approach, we compile GeoPlex, a
collection of $5$ multimodal datasets with varying characteristics and $11$
distinct sensors. We then train a single powerful model on these diverse
datasets simultaneously. Once fine-tuned, we achieve better or near
state-of-the-art results on the datasets of GeoPlex and $4$ additional ones for
$5$ environment monitoring tasks: land cover mapping, tree species
identification, crop type classification, change detection, and flood
segmentation. The code and models are available at
https://github.com/gastruc/AnySat.
☆ GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images
The rapid and accurate direct multi-frame interpolation method for Digital
Subtraction Angiography (DSA) images is crucial for reducing radiation and
providing real-time assistance to physicians for precise diagnostics and
treatment. DSA images contain complex vascular structures and various motions.
Applying natural scene Video Frame Interpolation (VFI) methods results in
motion artifacts, structural dissipation, and blurriness. Recently, MoSt-DSA
has specifically addressed these issues for the first time and achieved SOTA
results. However, MoSt-DSA's focus on real-time performance leads to
insufficient suppression of high-frequency noise and incomplete filtering of
low-frequency noise in the generated images. To address these issues within the
same computational time scale, we propose GaraMoSt. Specifically, we optimize
the network pipeline with a parallel design and propose a module named MG-MSFE.
MG-MSFE extracts frame-relative motion and structural features at various
granularities in a fully convolutional parallel manner and supports
independent, flexible adjustment of context-aware granularity at different
scales, thus enhancing computational efficiency and accuracy. Extensive
experiments demonstrate that GaraMoSt achieves the SOTA performance in
accuracy, robustness, visual effects, and noise suppression, comprehensively
surpassing MoSt-DSA and other natural scene VFI methods. The code and models
are available at https://github.com/ZyoungXu/GaraMoSt.
comment: Accepted by AAAI2025
☆ Event-based Photometric Bundle Adjustment
We tackle the problem of bundle adjustment (i.e., simultaneous refinement of
camera poses and scene map) for a purely rotating event camera. Starting from
first principles, we formulate the problem as a classical non-linear least
squares optimization. The photometric error is defined using the event
generation model directly in the camera rotations and the semi-dense scene
brightness that triggers the events. We leverage the sparsity of event data to
design a tractable Levenberg-Marquardt solver that handles the very large
number of variables involved. To the best of our knowledge, our method, which
we call Event-based Photometric Bundle Adjustment (EPBA), is the first
event-only photometric bundle adjustment method that works on the brightness
map directly and exploits the space-time characteristics of event data, without
having to convert events into image-like representations. Comprehensive
experiments on both synthetic and real-world datasets demonstrate EPBA's
effectiveness in decreasing the photometric error (by up to 90%), yielding
results of unparalleled quality. The refined maps reveal details that were
hidden using prior state-of-the-art rotation-only estimation methods. The
experiments on modern high-resolution event cameras show the applicability of
EPBA to panoramic imaging in various scenarios (without map initialization, at
multiple resolutions, and in combination with other methods, such as IMU dead
reckoning or previous event-based rotation estimation methods). We make the
source code publicly available. https://github.com/tub-rip/epba
comment: 21 pages, 19 figures, 10 tables. Project page:
https://github.com/tub-rip/epba
☆ Foundation Models Meet Low-Cost Sensors: Test-Time Adaptation for Rescaling Disparity for Zero-Shot Metric Depth Estimation
The recent development of foundation models for monocular depth estimation
such as Depth Anything paved the way to zero-shot monocular depth estimation.
Since it returns an affine-invariant disparity map, the favored technique to
recover the metric depth consists in fine-tuning the model. However, this stage
is costly to perform because of the training but also due to the creation of
the dataset. It must contain images captured by the camera that will be used at
test time and the corresponding ground truth. Moreover, the fine-tuning may
also degrade the generalizing capacity of the original model. Instead, we
propose in this paper a new method to rescale Depth Anything predictions using
3D points provided by low-cost sensors or techniques such as low-resolution
LiDAR, stereo camera, structure-from-motion where poses are given by an IMU.
Thus, this approach avoids fine-tuning and preserves the generalizing power of
the original depth estimation model while being robust to the noise of the
sensor or of the depth model. Our experiments highlight improvements relative
to other metric depth estimation methods and competitive results compared to
fine-tuned approaches. Code available at
https://gitlab.ensta.fr/ssh/monocular-depth-rescaling.
☆ Parameter-efficient Fine-tuning for improved Convolutional Baseline for Brain Tumor Segmentation in Sub-Saharan Africa Adult Glioma Dataset
Bijay Adhikari, Pratibha Kulung, Jakesh Bohaju, Laxmi Kanta Poudel, Confidence Raymond, Dong Zhang, Udunna C Anazodo, Bishesh Khanal, Mahesh Shakya
Automating brain tumor segmentation using deep learning methods is an ongoing
challenge in medical imaging. Multiple lingering issues exist including
domain-shift and applications in low-resource settings which brings a unique
set of challenges including scarcity of data. As a step towards solving these
specific problems, we propose Convolutional adapter-inspired
Parameter-efficient Fine-tuning (PEFT) of MedNeXt architecture. To validate our
idea, we show our method performs comparable to full fine-tuning with the added
benefit of reduced training compute using BraTS-2021 as pre-training dataset
and BraTS-Africa as the fine-tuning dataset. BraTS-Africa consists of a small
dataset (60 train / 35 validation) from the Sub-Saharan African population with
marked shift in the MRI quality compared to BraTS-2021 (1251 train samples). We
first show that models trained on BraTS-2021 dataset do not generalize well to
BraTS-Africa as shown by 20% reduction in mean dice on BraTS-Africa validation
samples. Then, we show that PEFT can leverage both the BraTS-2021 and
BraTS-Africa dataset to obtain mean dice of 0.8 compared to 0.72 when trained
only on BraTS-Africa. Finally, We show that PEFT (0.80 mean dice) results in
comparable performance to full fine-tuning (0.77 mean dice) which may show PEFT
to be better on average but the boxplots show that full finetuning results is
much lesser variance in performance. Nevertheless, on disaggregation of the
dice metrics, we find that the model has tendency to oversegment as shown by
high specificity (0.99) compared to relatively low sensitivity(0.75). The
source code is available at
https://github.com/CAMERA-MRI/SPARK2024/tree/main/PEFT_MedNeXt
comment: Accepted to "The International Brain Tumor Segmentation (BraTS)
challenge organized at MICCAI 2024 conference"
☆ Adaptive Concept Bottleneck for Foundation Models Under Distribution Shifts ICML 2024
Advancements in foundation models (FMs) have led to a paradigm shift in
machine learning. The rich, expressive feature representations from these
pre-trained, large-scale FMs are leveraged for multiple downstream tasks,
usually via lightweight fine-tuning of a shallow fully-connected network
following the representation. However, the non-interpretable, black-box nature
of this prediction pipeline can be a challenge, especially in critical domains
such as healthcare, finance, and security. In this paper, we explore the
potential of Concept Bottleneck Models (CBMs) for transforming complex,
non-interpretable foundation models into interpretable decision-making
pipelines using high-level concept vectors. Specifically, we focus on the
test-time deployment of such an interpretable CBM pipeline "in the wild", where
the input distribution often shifts from the original training distribution. We
first identify the potential failure modes of such a pipeline under different
types of distribution shifts. Then we propose an adaptive concept bottleneck
framework to address these failure modes, that dynamically adapts the
concept-vector bank and the prediction layer based solely on unlabeled data
from the target domain, without access to the source (training) dataset.
Empirical evaluations with various real-world distribution shifts show that our
adaptation method produces concept-based interpretations better aligned with
the test data and boosts post-deployment accuracy by up to 28%, aligning the
CBM performance with that of non-interpretable classification.
comment: The preliminary version of the work appeared in the ICML 2024
Workshop on Foundation Models in the Wild
☆ Joint Perception and Prediction for Autonomous Driving: A Survey
Perception and prediction modules are critical components of autonomous
driving systems, enabling vehicles to navigate safely through complex
environments. The perception module is responsible for perceiving the
environment, including static and dynamic objects, while the prediction module
is responsible for predicting the future behavior of these objects. These
modules are typically divided into three tasks: object detection, object
tracking, and motion prediction. Traditionally, these tasks are developed and
optimized independently, with outputs passed sequentially from one to the next.
However, this approach has significant limitations: computational resources are
not shared across tasks, the lack of joint optimization can amplify errors as
they propagate throughout the pipeline, and uncertainty is rarely propagated
between modules, resulting in significant information loss. To address these
challenges, the joint perception and prediction paradigm has emerged,
integrating perception and prediction into a unified model through multi-task
learning. This strategy not only overcomes the limitations of previous methods,
but also enables the three tasks to have direct access to raw sensor data,
allowing richer and more nuanced environmental interpretations. This paper
presents the first comprehensive survey of joint perception and prediction for
autonomous driving. We propose a taxonomy that categorizes approaches based on
input representation, scene context modeling, and output representation,
highlighting their contributions and limitations. Additionally, we present a
qualitative analysis and quantitative comparison of existing methods. Finally,
we discuss future research directions based on identified gaps in the
state-of-the-art.
comment: 24 pages, 5 sections, 7 figures, 7 tables. This work has been
submitted to the IEEE Transactions on Intelligent Transportation Systems for
possible publication
☆ Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models
Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, Huaping Liu
Foundation Vision Language Models (VLMs) exhibit strong capabilities in
multi-modal representation learning, comprehension, and reasoning. By injecting
action components into the VLMs, Vision-Language-Action Models (VLAs) can be
naturally formed and also show promising performance. Existing work has
demonstrated the effectiveness and generalization of VLAs in multiple scenarios
and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since
existing VLAs differ in their backbones, action-prediction formulations, data
distributions, and training recipes. This leads to a missing piece for a
systematic understanding of the design choices of VLAs. In this work, we
disclose the key factors that significantly influence the performance of VLA
and focus on answering three essential design choices: which backbone to
select, how to formulate the VLA architectures, and when to add
cross-embodiment data. The obtained results convince us firmly to explain why
we need VLA and develop a new family of VLAs, RoboVLMs, which require very few
manual designs and achieve a new state-of-the-art performance in three
simulation tasks and real-world experiments. Through our extensive experiments,
which include over 8 VLM backbones, 4 policy architectures, and over 600
distinct designed experiments, we provide a detailed guidebook for the future
design of VLAs. In addition to the study, the highly flexible RoboVLMs
framework, which supports easy integrations of new VLMs and free combinations
of various design choices, is made public to facilitate future research. We
open-source all details, including codes, models, datasets, and toolkits, along
with detailed training and evaluation recipes at: robovlms.github.io.
comment: Project page: robovlms.github.io
☆ A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future
Artificial intelligence (AI) has rapidly developed through advancements in
computational power and the growth of massive datasets. However, this progress
has also heightened challenges in interpreting the "black-box" nature of AI
models. To address these concerns, eXplainable AI (XAI) has emerged with a
focus on transparency and interpretability to enhance human understanding and
trust in AI decision-making processes. In the context of multimodal data fusion
and complex reasoning scenarios, the proposal of Multimodal eXplainable AI
(MXAI) integrates multiple modalities for prediction and explanation tasks.
Meanwhile, the advent of Large Language Models (LLMs) has led to remarkable
breakthroughs in natural language processing, yet their complexity has further
exacerbated the issue of MXAI. To gain key insights into the development of
MXAI methods and provide crucial guidance for building more transparent, fair,
and trustworthy AI systems, we review the MXAI methods from a historical
perspective and categorize them across four eras: traditional machine learning,
deep learning, discriminative foundation models, and generative LLMs. We also
review evaluation metrics and datasets used in MXAI research, concluding with a
discussion of future challenges and directions. A project related to this
review has been created at https://github.com/ShilinSun/mxai_review.
comment: This work has been submitted to the IEEE for possible publication
☆ CAD-Recode: Reverse Engineering CAD Code from Point Clouds
Computer-Aided Design (CAD) models are typically constructed by sequentially
drawing parametric sketches and applying CAD operations to obtain a 3D model.
The problem of 3D CAD reverse engineering consists of reconstructing the sketch
and CAD operation sequences from 3D representations such as point clouds. In
this paper, we address this challenge through novel contributions across three
levels: CAD sequence representation, network design, and dataset. In
particular, we represent CAD sketch-extrude sequences as Python code. The
proposed CAD-Recode translates a point cloud into Python code that, when
executed, reconstructs the CAD model. Taking advantage of the exposure of
pre-trained Large Language Models (LLMs) to Python code, we leverage a
relatively small LLM as a decoder for CAD-Recode and combine it with a
lightweight point cloud projector. CAD-Recode is trained solely on a proposed
synthetic dataset of one million diverse CAD sequences. CAD-Recode
significantly outperforms existing methods across three datasets while
requiring fewer input points. Notably, it achieves 10 times lower mean Chamfer
distance than state-of-the-art methods on DeepCAD and Fusion360 datasets.
Furthermore, we show that our CAD Python code output is interpretable by
off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering
from point clouds.
☆ SurgSora: Decoupled RGBD-Flow Diffusion Model for Controllable Surgical Video Generation
Medical video generation has transformative potential for enhancing surgical
understanding and pathology insights through precise and controllable visual
representations. However, current models face limitations in controllability
and authenticity. To bridge this gap, we propose SurgSora, a
motion-controllable surgical video generation framework that uses a single
input frame and user-controllable motion cues. SurgSora consists of three key
modules: the Dual Semantic Injector (DSI), which extracts object-relevant RGB
and depth features from the input frame and integrates them with segmentation
cues to capture detailed spatial features of complex anatomical structures; the
Decoupled Flow Mapper (DFM), which fuses optical flow with semantic-RGB-D
features at multiple scales to enhance temporal understanding and object
spatial dynamics; and the Trajectory Controller (TC), which allows users to
specify motion directions and estimates sparse optical flow, guiding the video
generation process. The fused features are used as conditions for a frozen
Stable Diffusion model to produce realistic, temporally coherent surgical
videos. Extensive evaluations demonstrate that SurgSora outperforms
state-of-the-art methods in controllability and authenticity, showing its
potential to advance surgical video generation for medical education, training,
and research.
☆ Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang
Prompts play a critical role in unleashing the power of language and vision
foundation models for specific tasks. For the first time, we introduce
prompting into depth foundation models, creating a new paradigm for metric
depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost
LiDAR as the prompt to guide the Depth Anything model for accurate metric depth
output, achieving up to 4K resolution. Our approach centers on a concise prompt
fusion design that integrates the LiDAR at multiple scales within the depth
decoder. To address training challenges posed by limited datasets containing
both LiDAR depth and precise GT depth, we propose a scalable data pipeline that
includes synthetic data LiDAR simulation and real data pseudo GT depth
generation. Our approach sets new state-of-the-arts on the ARKitScenes and
ScanNet++ datasets and benefits downstream applications, including 3D
reconstruction and generalized robotic grasping.
comment: Project page: https://PromptDA.github.io/
☆ InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal
segmentation models for the image and video domains have made rapid progress
recently. However, these methods are often developed separately for specific
domains, overlooking the similarities in task settings and solutions across
these two areas. In this paper, we define the union of referring segmentation
and reasoning segmentation at both the image and video levels as Instructed
Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an
end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we
employ an object-aware video perceiver to extract temporal and object
information from reference frames, facilitating comprehensive video
understanding. Additionally, we introduce vision-guided multi-granularity text
fusion to better integrate global and detailed text information with
fine-grained visual guidance. By leveraging multi-task and end-to-end training,
InstructSeg demonstrates superior performance across diverse image and video
segmentation tasks, surpassing both segmentation specialists and MLLM-based
methods with a single model. Our code is available at
https://github.com/congvvc/InstructSeg.
☆ Real-Time Position-Aware View Synthesis from Single-View Input
Recent advancements in view synthesis have significantly enhanced immersive
experiences across various computer graphics and multimedia applications,
including telepresence, and entertainment. By enabling the generation of new
perspectives from a single input view, view synthesis allows users to better
perceive and interact with their environment. However, many state-of-the-art
methods, while achieving high visual quality, face limitations in real-time
performance, which makes them less suitable for live applications where low
latency is critical. In this paper, we present a lightweight, position-aware
network designed for real-time view synthesis from a single input image and a
target camera pose. The proposed framework consists of a Position Aware
Embedding, modeled with a multi-layer perceptron, which efficiently maps
positional information from the target pose to generate high dimensional
feature maps. These feature maps, along with the input image, are fed into a
Rendering Network that merges features from dual encoder branches to resolve
both high level semantics and low level details, producing a realistic new view
of the scene. Experimental results demonstrate that our method achieves
superior efficiency and visual quality compared to existing approaches,
particularly in handling complex translational movements without explicit
geometric operations like warping. This work marks a step toward enabling
real-time view synthesis from a single image for live and interactive
applications.
☆ GraphAvatar: Compact Head Avatars with GNN-Generated 3D Gaussians
Rendering photorealistic head avatars from arbitrary viewpoints is crucial
for various applications like virtual reality. Although previous methods based
on Neural Radiance Fields (NeRF) can achieve impressive results, they lack
fidelity and efficiency. Recent methods using 3D Gaussian Splatting (3DGS) have
improved rendering quality and real-time performance but still require
significant storage overhead. In this paper, we introduce a method called
GraphAvatar that utilizes Graph Neural Networks (GNN) to generate 3D Gaussians
for the head avatar. Specifically, GraphAvatar trains a geometric GNN and an
appearance GNN to generate the attributes of the 3D Gaussians from the tracked
mesh. Therefore, our method can store the GNN models instead of the 3D
Gaussians, significantly reducing the storage overhead to just 10MB. To reduce
the impact of face-tracking errors, we also present a novel graph-guided
optimization module to refine face-tracking parameters during training.
Finally, we introduce a 3D-aware enhancer for post-processing to enhance the
rendering quality. We conduct comprehensive experiments to demonstrate the
advantages of GraphAvatar, surpassing existing methods in visual fidelity and
storage consumption. The ablation study sheds light on the trade-offs between
rendering quality and model size. The code will be released at:
https://github.com/ucwxb/GraphAvatar
comment: accepted by AAAI2025
☆ Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
Jinghan He, Kuan Zhu, Haiyun Guo, Junfeng Fang, Zhenglin Hua, Yuheng Jia, Ming Tang, Tat-Seng Chua, Jinqiao Wang
Large vision-language models (LVLMs) have made substantial progress in
integrating large language models (LLMs) with visual inputs, enabling advanced
multimodal reasoning. Despite their success, a persistent challenge is
hallucination-where generated text fails to accurately reflect visual
content-undermining both accuracy and reliability. Existing methods focus on
alignment training or decoding refinements but primarily address symptoms at
the generation stage without probing the underlying causes. In this work, we
investigate the internal mechanisms driving hallucination in LVLMs, with an
emphasis on the multi-head attention module. Specifically, we introduce
Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of
attention head outputs to visual context. Based on this, our findings reveal
the presence of vision-aware attention heads that are more attuned to visual
information; however, the model's overreliance on its prior language patterns
is closely related to hallucinations. Building on these insights, we propose
Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate
hallucination by enhancing the role of vision-aware attention heads. Extensive
experiments demonstrate that our method achieves superior performance compared
to state-of-the-art approaches in mitigating hallucinations, while maintaining
high efficiency with negligible additional time overhead.
☆ Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition
In this study, we define and tackle zero shot "real" classification by
description, a novel task that evaluates the ability of Vision-Language Models
(VLMs) like CLIP to classify objects based solely on descriptive attributes,
excluding object class names. This approach highlights the current limitations
of VLMs in understanding intricate object descriptions, pushing these models
beyond mere object recognition. To facilitate this exploration, we introduce a
new challenge and release description data for six popular fine-grained
benchmarks, which omit object names to encourage genuine zero-shot learning
within the research community. Additionally, we propose a method to enhance
CLIP's attribute detection capabilities through targeted training using
ImageNet21k's diverse object categories, paired with rich attribute
descriptions generated by large language models. Furthermore, we introduce a
modified CLIP architecture that leverages multiple resolutions to improve the
detection of fine-grained part attributes. Through these efforts, we broaden
the understanding of part-attribute recognition in CLIP, improving its
performance in fine-grained classification tasks across six popular benchmarks,
as well as in the PACO dataset, a widely used benchmark for object-attribute
recognition. Code is available at:
https://github.com/ethanbar11/grounding_ge_public.
☆ On Explaining Knowledge Distillation: Measuring and Visualising the Knowledge Transfer Process
Knowledge distillation (KD) remains challenging due to the opaque nature of
the knowledge transfer process from a Teacher to a Student, making it difficult
to address certain issues related to KD. To address this, we proposed UniCAM, a
novel gradient-based visual explanation method, which effectively interprets
the knowledge learned during KD. Our experimental results demonstrate that with
the guidance of the Teacher's knowledge, the Student model becomes more
efficient, learning more relevant features while discarding those that are not
relevant. We refer to the features learned with the Teacher's guidance as
distilled features and the features irrelevant to the task and ignored by the
Student as residual features. Distilled features focus on key aspects of the
input, such as textures and parts of objects. In contrast, residual features
demonstrate more diffused attention, often targeting irrelevant areas,
including the backgrounds of the target objects. In addition, we proposed two
novel metrics: the feature similarity score (FSS) and the relevance score (RS),
which quantify the relevance of the distilled knowledge. Experiments on the
CIFAR10, ASIRRA, and Plant Disease datasets demonstrate that UniCAM and the two
metrics offer valuable insights to explain the KD process.
comment: Accepted to 2025 IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV'25). Includes 5 pages of supplementary material
☆ Retrieval Augmented Image Harmonization
When embedding objects (foreground) into images (background), considering the
influence of photography conditions like illumination, it is usually necessary
to perform image harmonization to make the foreground object coordinate with
the background image in terms of brightness, color, and etc. Although existing
image harmonization methods have made continuous efforts toward visually
pleasing results, they are still plagued by two main issues. Firstly, the image
harmonization becomes highly ill-posed when there are no contents similar to
the foreground object in the background, making the harmonization results
unreliable. Secondly, even when similar contents are available, the
harmonization process is often interfered with by irrelevant areas, mainly
attributed to an insufficient understanding of image contents and inaccurate
attention. As a remedy, we present a retrieval-augmented image harmonization
(Raiha) framework, which seeks proper reference images to reduce the
ill-posedness and restricts the attention to better utilize the useful
information. Specifically, an efficient retrieval method is designed to find
reference images that contain similar objects as the foreground while the
illumination is consistent with the background. For training the Raiha
framework to effectively utilize the reference information, a data augmentation
strategy is delicately designed by leveraging existing non-reference image
harmonization datasets. Besides, the image content priors are introduced to
ensure reasonable attention. With the presented Raiha framework, the image
harmonization performance is greatly boosted under both non-reference and
retrieval-augmented settings. The source code and pre-trained models will be
publicly available.
comment: 8 pages
☆ A Black-Box Evaluation Framework for Semantic Robustness in Bird's Eye View Detection
Camera-based Bird's Eye View (BEV) perception models receive increasing
attention for their crucial role in autonomous driving, a domain where concerns
about the robustness and reliability of deep learning have been raised. While
only a few works have investigated the effects of randomly generated semantic
perturbations, aka natural corruptions, on the multi-view BEV detection task,
we develop a black-box robustness evaluation framework that adversarially
optimises three common semantic perturbations: geometric transformation, colour
shifting, and motion blur, to deceive BEV models, serving as the first approach
in this emerging field. To address the challenge posed by optimising the
semantic perturbation, we design a smoothed, distance-based surrogate function
to replace the mAP metric and introduce SimpleDIRECT, a deterministic
optimisation algorithm that utilises observed slopes to guide the optimisation
process. By comparing with randomised perturbation and two optimisation
baselines, we demonstrate the effectiveness of the proposed framework.
Additionally, we provide a benchmark on the semantic robustness of ten recent
BEV models. The results reveal that PolarFormer, which emphasises geometric
information from multi-view images, exhibits the highest robustness, whereas
BEVDet is fully compromised, with its precision reduced to zero.
☆ Memorizing SAM: 3D Medical Segment Anything Model with Memorizing Transformer
Segment Anything Models (SAMs) have gained increasing attention in medical
image analysis due to their zero-shot generalization capability in segmenting
objects of unseen classes and domains when provided with appropriate user
prompts. Addressing this performance gap is important to fully leverage the
pre-trained weights of SAMs, particularly in the domain of volumetric medical
image segmentation, where accuracy is important but well-annotated 3D medical
data for fine-tuning is limited. In this work, we investigate whether
introducing the memory mechanism as a plug-in, specifically the ability to
memorize and recall internal representations of past inputs, can improve the
performance of SAM with limited computation cost. To this end, we propose
Memorizing SAM, a novel 3D SAM architecture incorporating a memory Transformer
as a plug-in. Unlike conventional memorizing Transformers that save the
internal representation during training or inference, our Memorizing SAM
utilizes existing highly accurate internal representation as the memory source
to ensure the quality of memory. We evaluate the performance of Memorizing SAM
in 33 categories from the TotalSegmentator dataset, which indicates that
Memorizing SAM can outperform state-of-the-art 3D SAM variant i.e., FastSAM3D
with an average Dice increase of 11.36% at the cost of only 4.38 millisecond
increase in inference time. The source code is publicly available at
https://github.com/swedfr/memorizingSAM
☆ Data-Efficient Inference of Neural Fluid Fields via SciML Foundation Model
Recent developments in 3D vision have enabled successful progress in
inferring neural fluid fields and realistic rendering of fluid dynamics.
However, these methods require real-world flow captures, which demand dense
video sequences and specialized lab setups, making the process costly and
challenging. Scientific machine learning (SciML) foundation models, which are
pretrained on extensive simulations of partial differential equations (PDEs),
encode rich multiphysics knowledge and thus provide promising sources of domain
priors for inferring fluid fields. Nevertheless, their potential to advance
real-world vision problems remains largely underexplored, raising questions
about the transferability and practical utility of these foundation models. In
this work, we demonstrate that SciML foundation model can significantly improve
the data efficiency of inferring real-world 3D fluid dynamics with improved
generalization. At the core of our method is leveraging the strong forecasting
capabilities and meaningful representations of SciML foundation models. We
equip neural fluid fields with a novel collaborative training approach that
utilizes augmented views and fluid features extracted by our foundation model.
Our method demonstrates significant improvements in both quantitative metrics
and visual quality, showcasing the practical applicability of SciML foundation
models in real-world fluid dynamics.
☆ Navigating limitations with precision: A fine-grained ensemble approach to wrist pathology recognition on a limited x-ray dataset
The exploration of automated wrist fracture recognition has gained
considerable research attention in recent years. In practical medical
scenarios, physicians and surgeons may lack the specialized expertise required
for accurate X-ray interpretation, highlighting the need for machine vision to
enhance diagnostic accuracy. However, conventional recognition techniques face
challenges in discerning subtle differences in X-rays when classifying wrist
pathologies, as many of these pathologies, such as fractures, can be small and
hard to distinguish. This study tackles wrist pathology recognition as a
fine-grained visual recognition (FGVR) problem, utilizing a limited,
custom-curated dataset that mirrors real-world medical constraints, relying
solely on image-level annotations. We introduce a specialized FGVR-based
ensemble approach to identify discriminative regions within X-rays. We employ
an Explainable AI (XAI) technique called Grad-CAM to pinpoint these regions.
Our ensemble approach outperformed many conventional SOTA and FGVR techniques,
underscoring the effectiveness of our strategy in enhancing accuracy in wrist
pathology recognition.
☆ Denoising Nearest Neighbor Graph via Continuous CRF for Visual Re-ranking without Fine-tuning
Visual re-ranking using Nearest Neighbor graph~(NN graph) has been adapted to
yield high retrieval accuracy, since it is beneficial to exploring an
high-dimensional manifold and applicable without additional fine-tuning. The
quality of visual re-ranking using NN graph, however, is limited to that of
connectivity, i.e., edges of the NN graph. Some edges can be misconnected with
negative images. This is known as a noisy edge problem, resulting in a
degradation of the retrieval quality. To address this, we propose a
complementary denoising method based on Continuous Conditional Random Field
(C-CRF) that uses a statistical distance of our similarity-based distribution.
This method employs the concept of cliques to make the process computationally
feasible. We demonstrate the complementarity of our method through its
application to three visual re-ranking methods, observing quality boosts in
landmark retrieval and person re-identification (re-ID).
☆ LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer
Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun
In multimodal large language models (MLLMs), vision transformers (ViTs) are
widely employed for visual encoding. However, their performance in solving
universal MLLM tasks is not satisfactory. We attribute it to a lack of
information from diverse visual levels, impeding alignment with the various
semantic granularity required for language generation. To address this issue,
we present LLaVA-UHD v2, an advanced MLLM centered around a Hierarchical window
transformer that enables capturing diverse visual granularity by constructing
and integrating a high-resolution feature pyramid. As a vision-language
projector, Hiwin transformer comprises two primary modules: (i) an inverse
feature pyramid, constructed by a ViT-derived feature up-sampling process
utilizing high-frequency details from an image pyramid, and (ii) hierarchical
window attention, focusing on a set of key sampling features within cross-scale
windows to condense multi-level feature maps. Extensive experiments demonstrate
that LLaVA-UHD v2 achieves superior performance over existing MLLMs on popular
benchmarks. Notably, our design brings an average boost of 3.7% across 14
benchmarks compared with the baseline method, 9.3% on DocVQA for instance. We
make all the data, model checkpoint, and code publicly available to facilitate
future research.
☆ Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models
Classifying scanned documents is a challenging problem that involves image,
layout, and text analysis for document understanding. Nevertheless, for certain
benchmark datasets, notably RVL-CDIP, the state of the art is closing in to
near-perfect performance when considering hundreds of thousands of training
samples. With the advent of large language models (LLMs), which are excellent
few-shot learners, the question arises to what extent the document
classification problem can be addressed with only a few training samples, or
even none at all. In this paper, we investigate this question in the context of
zero-shot prompting and few-shot model fine-tuning, with the aim of reducing
the need for human-annotated training samples as much as possible.
comment: ICPR 2024
☆ Diagnosising Helicobacter pylori using AutoEncoders and Limited Annotations through Anomalous Staining Patterns in IHC Whole Slide Images
Purpose: This work addresses the detection of Helicobacter pylori (H. pylori)
in histological images with immunohistochemical staining. This analysis is a
time demanding task, currently done by an expert pathologist that visually
inspects the samples. Given the effort required to localise the pathogen in
images, a limited number of annotations might be available in an initial
setting. Our goal is to design an approach that, using a limited set of
annotations, is capable of obtaining results good enough to be used as a
support tool. Methods: We propose to use autoencoders to learn the latent
patterns of healthy patches and formulate a specific measure of the
reconstruction error of the image in HSV space. ROC analysis is used to set the
optimal threshold of this measure and the percentage of positive patches in a
sample that determines the presence of H. pylori. Results: Our method has been
tested on an own database of 245 Whole Slide Images (WSI) having 117 cases
without H. pylori and different density of the bacteria in the remaining ones.
The database has 1211 annotated patches, with only 163 positive patches. This
dataset of positive annotations was used to train a baseline thresholding and
an SVM using the features of a pre-trained RedNet18 and ViT models. A 10-fold
cross-validation shows that our method has better performance with 91%
accuracy, 86% sensitivity, 96% specificity and 0.97 AUC in the diagnosis of H.
pylori. Conclusion: Unlike classification approaches, our shallow autoencoder
with threshold adaptation for the detection of anomalous staining is able to
achieve competitive results with a limited set of annotated data. This initial
approach is good enough to be used as a guide for fast annotation of infected
patches.
☆ A Systematic Analysis of Input Modalities for Fracture Classification of the Paediatric Wrist
Fractures, particularly in the distal forearm, are among the most common
injuries in children and adolescents, with approximately 800 000 cases treated
annually in Germany. The AO/OTA system provides a structured fracture type
classification, which serves as the foundation for treatment decisions.
Although accurately classifying fractures can be challenging, current deep
learning models have demonstrated performance comparable to that of experienced
radiologists. While most existing approaches rely solely on radiographs, the
potential impact of incorporating other additional modalities, such as
automatic bone segmentation, fracture location, and radiology reports, remains
underexplored. In this work, we systematically analyse the contribution of
these three additional information types, finding that combining them with
radiographs increases the AUROC from 91.71 to 93.25. Our code is available on
GitHub.
comment: Code available on
https://github.com/multimodallearning/AO_Classification
☆ MobiFuse: A High-Precision On-device Depth Perception System with Multi-Data Fusion
Jinrui Zhang, Deyu Zhang, Tingting Long, Wenxin Chen, Ju Ren, Yunxin Liu, Yudong Zhao, Yaoxue Zhang, Youngki Lee
We present MobiFuse, a high-precision depth perception system on mobile
devices that combines dual RGB and Time-of-Flight (ToF) cameras. To achieve
this, we leverage physical principles from various environmental factors to
propose the Depth Error Indication (DEI) modality, characterizing the depth
error of ToF and stereo-matching. Furthermore, we employ a progressive fusion
strategy, merging geometric features from ToF and stereo depth maps with depth
error features from the DEI modality to create precise depth maps.
Additionally, we create a new ToF-Stereo depth dataset, RealToF, to train and
validate our model. Our experiments demonstrate that MobiFuse excels over
baselines by significantly reducing depth measurement errors by up to 77.7%. It
also showcases strong generalization across diverse datasets and proves
effectiveness in two downstream tasks: 3D reconstruction and 3D segmentation.
The demo video of MobiFuse in real-life scenarios is available at the
de-identified YouTube link(https://youtu.be/jy-Sp7T1LVs).
☆ Do Language Models Understand Time?
Large language models (LLMs) have revolutionized video-based computer vision
applications, including action recognition, anomaly detection, and video
summarization. Videos inherently pose unique challenges, combining spatial
complexity with temporal dynamics that are absent in static images or textual
data. Current approaches to video understanding with LLMs often rely on
pretrained video encoders to extract spatiotemporal features and text encoders
to capture semantic meaning. These representations are integrated within LLM
frameworks, enabling multimodal reasoning across diverse video tasks. However,
the critical question persists: Can LLMs truly understand the concept of time,
and how effectively can they reason about temporal relationships in videos?
This work critically examines the role of LLMs in video processing, with a
specific focus on their temporal reasoning capabilities. We identify key
limitations in the interaction between LLMs and pretrained encoders, revealing
gaps in their ability to model long-term dependencies and abstract temporal
concepts such as causality and event progression. Furthermore, we analyze
challenges posed by existing video datasets, including biases, lack of temporal
annotations, and domain-specific limitations that constrain the temporal
understanding of LLMs. To address these gaps, we explore promising future
directions, including the co-evolution of LLMs and encoders, the development of
enriched datasets with explicit temporal labels, and innovative architectures
for integrating spatial, temporal, and semantic reasoning. By addressing these
challenges, we aim to advance the temporal comprehension of LLMs, unlocking
their full potential in video analysis and beyond.
comment: Research report
☆ Prompt Categories Cluster for Weakly Supervised Semantic Segmentation
Weakly Supervised Semantic Segmentation (WSSS), which leverages image-level
labels, has garnered significant attention due to its cost-effectiveness. The
previous methods mainly strengthen the inter-class differences to avoid class
semantic ambiguity which may lead to erroneous activation. However, they
overlook the positive function of some shared information between similar
classes. Categories within the same cluster share some similar features.
Allowing the model to recognize these features can further relieve the semantic
ambiguity between these classes. To effectively identify and utilize this
shared information, in this paper, we introduce a novel WSSS framework called
Prompt Categories Clustering (PCC). Specifically, we explore the ability of
Large Language Models (LLMs) to derive category clusters through prompts. These
clusters effectively represent the intrinsic relationships between categories.
By integrating this relational information into the training network, our model
is able to better learn the hidden connections between categories. Experimental
results demonstrate the effectiveness of our approach, showing its ability to
enhance performance on the PASCAL VOC 2012 dataset and surpass existing
state-of-the-art methods in WSSS.
☆ Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection
Recent studies have shown that large vision-language models (LVLMs) often
suffer from the issue of object hallucinations (OH). To mitigate this issue, we
introduce an efficient method that edits the model weights based on an unsafe
subspace, which we call HalluSpace in this paper. With truthful and
hallucinated text prompts accompanying the visual content as inputs, the
HalluSpace can be identified by extracting the hallucinated embedding features
and removing the truthful representations in LVLMs. By orthogonalizing the
model weights, input features will be projected into the Null space of the
HalluSpace to reduce OH, based on which we name our method Nullu. We reveal
that HalluSpaces generally contain statistical bias and unimodal priors of the
large language models (LLMs) applied to build LVLMs, which have been shown as
essential causes of OH in previous studies. Therefore, null space projection
suppresses the LLMs' priors to filter out the hallucinated features, resulting
in contextually accurate outputs. Experiments show that our method can
effectively mitigate OH across different LVLM families without extra inference
costs and also show strong performance in general LVLM benchmarks. Code is
released at \url{https://github.com/Ziwei-Zheng/Nullu}.
comment: Under review
☆ Object Style Diffusion for Generalized Object Detection in Urban Scene
Object detection is a critical task in computer vision, with applications in
various domains such as autonomous driving and urban scene monitoring. However,
deep learning-based approaches often demand large volumes of annotated data,
which are costly and difficult to acquire, particularly in complex and
unpredictable real-world environments. This dependency significantly hampers
the generalization capability of existing object detection techniques. To
address this issue, we introduce a novel single-domain object detection
generalization method, named GoDiff, which leverages a pre-trained model to
enhance generalization in unseen domains. Central to our approach is the Pseudo
Target Data Generation (PTDG) module, which employs a latent diffusion model to
generate pseudo-target domain data that preserves source domain characteristics
while introducing stylistic variations. By integrating this pseudo data with
source domain data, we diversify the training dataset. Furthermore, we
introduce a cross-style instance normalization technique to blend style
features from different domains generated by the PTDG module, thereby
increasing the detector's robustness. Experimental results demonstrate that our
method not only enhances the generalization ability of existing detectors but
also functions as a plug-and-play enhancement for other single-domain
generalization methods, achieving state-of-the-art performance in autonomous
driving scenarios.
☆ Spatial Brain Tumor Concentration Estimation for Individualized Radiotherapy Planning
Jonas Weidner, Michal Balcerak, Ivan Ezhov, André Datchev, Laurin Lux, Lucas Zimmerand Daniel Rueckert, Björn Menze, Benedikt Wiestler
Biophysical modeling of brain tumors has emerged as a promising strategy for
personalizing radiotherapy planning by estimating the otherwise hidden
distribution of tumor cells within the brain. However, many existing
state-of-the-art methods are computationally intensive, limiting their
widespread translation into clinical practice. In this work, we propose an
efficient and direct method that utilizes soft physical constraints to estimate
the tumor cell concentration from preoperative MRI of brain tumor patients. Our
approach optimizes a 3D tumor concentration field by simultaneously minimizing
the difference between the observed MRI and a physically informed loss
function. Compared to existing state-of-the-art techniques, our method
significantly improves predicting tumor recurrence on two public datasets with
a total of 192 patients while maintaining a clinically viable runtime of under
one minute - a substantial reduction from the 30 minutes required by the
current best approach. Furthermore, we showcase the generalizability of our
framework by incorporating additional imaging information and physical
constraints, highlighting its potential to translate to various medical
diffusion phenomena with imperfect data.
☆ CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers?
Dimitrios Mallis, Ahmet Serdar Karadeniz, Sebastian Cavada, Danila Rukhovich, Niki Foteinopoulou, Kseniya Cherenkova, Anis Kacem, Djamila Aouada
We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design.
Our approach is based on a powerful Vision and Large Language Model (VLLM) as a
planner and a tool-augmentation paradigm using CAD-specific modules.
CAD-Assistant addresses multimodal user queries by generating actions that are
iteratively executed on a Python interpreter equipped with the FreeCAD
software, accessed via its Python API. Our framework is able to assess the
impact of generated CAD commands on geometry and adapts subsequent actions
based on the evolving state of the CAD design. We consider a wide range of
CAD-specific tools including Python libraries, modules of the FreeCAD Python
API, helpful routines, rendering functions and other specialized modules. We
evaluate our method on multiple CAD benchmarks and qualitatively demonstrate
the potential of tool-augmented VLLMs as generic CAD task solvers across
diverse CAD workflows.
☆ M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
Intelligent robots need to interact with diverse objects across various
environments. The appearance and state of objects frequently undergo complex
transformations depending on the object properties, e.g., phase transitions.
However, in the vision community, segmenting dynamic objects with phase
transitions is overlooked. In light of this, we introduce the concept of phase
in segmentation, which categorizes real-world objects based on their visual
characteristics and potential morphological and appearance changes. Then, we
present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video
Object Segmentation (M3-VOS), to verify the ability of models to understand
object phases, which consists of 479 high-resolution videos spanning over 10
distinct everyday scenarios. It provides dense instance mask annotations that
capture both object phases and their transitions. We evaluate state-of-the-art
methods on M3-VOS, yielding several key insights. Notably, current appearance
based approaches show significant room for improvement when handling objects
with phase transitions. The inherent changes in disorder suggest that the
predictive performance of the forward entropy-increasing process can be
improved through a reverse entropy-reducing process. These findings lead us to
propose ReVOS, a new plug-and-play model that improves its performance by
reversal refinement. Our data and code will be publicly available
comment: 18 pages, 12 figures
☆ An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training
Haiming Zhang, Ying Xue, Xu Yan, Jiacheng Zhang, Weichao Qiu, Dongfeng Bai, Bingbing Liu, Shuguang Cui, Zhen Li
The field of autonomous driving is experiencing a surge of interest in world
models, which aim to predict potential future scenarios based on historical
observations. In this paper, we introduce DFIT-OccWorld, an efficient 3D
occupancy world model that leverages decoupled dynamic flow and image-assisted
training strategy, substantially improving 4D scene forecasting performance. To
simplify the training process, we discard the previous two-stage training
strategy and innovatively reformulate the occupancy forecasting problem as a
decoupled voxels warping process. Our model forecasts future dynamic voxels by
warping existing observations using voxel flow, whereas static voxels are
easily obtained through pose transformation. Moreover, our method incorporates
an image-assisted training paradigm to enhance prediction reliability.
Specifically, differentiable volume rendering is adopted to generate rendered
depth maps through predicted future volumes, which are adopted in render-based
photometric consistency. Experiments demonstrate the effectiveness of our
approach, showcasing its state-of-the-art performance on the nuScenes and
OpenScene benchmarks for 4D occupancy forecasting, end-to-end motion planning
and point cloud forecasting. Concretely, it achieves state-of-the-art
performances compared to existing 3D world models while incurring substantially
lower computational costs.
☆ Mesoscopic Insights: Orchestrating Multi-scale & Hybrid Architecture for Image Manipulation Localization
Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, Jizhe Zhou
The mesoscopic level serves as a bridge between the macroscopic and
microscopic worlds, addressing gaps overlooked by both. Image manipulation
localization (IML), a crucial technique to pursue truth from fake images, has
long relied on low-level (microscopic-level) traces. However, in practice, most
tampering aims to deceive the audience by altering image semantics. As a
result, manipulation commonly occurs at the object level (macroscopic level),
which is equally important as microscopic traces. Therefore, integrating these
two levels into the mesoscopic level presents a new perspective for IML
research. Inspired by this, our paper explores how to simultaneously construct
mesoscopic representations of micro and macro information for IML and
introduces the Mesorch architecture to orchestrate both. Specifically, this
architecture i) combines Transformers and CNNs in parallel, with Transformers
extracting macro information and CNNs capturing micro details, and ii) explores
across different scales, assessing micro and macro information seamlessly.
Additionally, based on the Mesorch architecture, the paper introduces two
baseline models aimed at solving IML tasks through mesoscopic representation.
Extensive experiments across four datasets have demonstrated that our models
surpass the current state-of-the-art in terms of performance, computational
complexity, and robustness.
comment: AAAI 2025. Code:
$\href{https://github.com/scu-zjz/Mesorch}{this~url}$
☆ Multi-Exposure Image Fusion via Distilled 3D LUT Grid with Editable Mode
With the rising imaging resolution of handheld devices, existing
multi-exposure image fusion algorithms struggle to generate a high dynamic
range image with ultra-high resolution in real-time. Apart from that, there is
a trend to design a manageable and editable algorithm as the different needs of
real application scenarios. To tackle these issues, we introduce 3D LUT
technology, which can enhance images with ultra-high-definition (UHD)
resolution in real time on resource-constrained devices. However, since the
fusion of information from multiple images with different exposure rates is
uncertain, and this uncertainty significantly trials the generalization power
of the 3D LUT grid. To address this issue and ensure a robust learning space
for the model, we propose using a teacher-student network to model the
uncertainty on the 3D LUT grid.Furthermore, we provide an editable mode for the
multi-exposure image fusion algorithm by using the implicit representation
function to match the requirements in different scenarios. Extensive
experiments demonstrate that our proposed method is highly competitive in
efficiency and accuracy.
☆ Learnable Prompting SAM-induced Knowledge Distillation for Semi-supervised Medical Image Segmentation
The limited availability of labeled data has driven advancements in
semi-supervised learning for medical image segmentation. Modern large-scale
models tailored for general segmentation, such as the Segment Anything Model
(SAM), have revealed robust generalization capabilities. However, applying
these models directly to medical image segmentation still exposes performance
degradation. In this paper, we propose a learnable prompting SAM-induced
Knowledge distillation framework (KnowSAM) for semi-supervised medical image
segmentation. Firstly, we propose a Multi-view Co-training (MC) strategy that
employs two distinct sub-networks to employ a co-teaching paradigm, resulting
in more robust outcomes. Secondly, we present a Learnable Prompt Strategy (LPS)
to dynamically produce dense prompts and integrate an adapter to fine-tune SAM
specifically for medical image segmentation tasks. Moreover, we propose
SAM-induced Knowledge Distillation (SKD) to transfer useful knowledge from SAM
to two sub-networks, enabling them to learn from SAM's predictions and
alleviate the effects of incorrect pseudo-labels during training. Notably, the
predictions generated by our subnets are used to produce mask prompts for SAM,
facilitating effective inter-module information exchange. Extensive
experimental results on various medical segmentation tasks demonstrate that our
model outperforms the state-of-the-art semi-supervised segmentation approaches.
Crucially, our SAM distillation framework can be seamlessly integrated into
other semi-supervised segmentation methods to enhance performance. The code
will be released upon acceptance of this manuscript at:
https://github.com/taozh2017/KnowSAM
comment: 12 pages, 7 figures
☆ MedCoT: Medical Chain of Thought via Hierarchical Expert
Artificial intelligence has advanced in Medical Visual Question Answering
(Med-VQA), but prevalent research tends to focus on the accuracy of the
answers, often overlooking the reasoning paths and interpretability, which are
crucial in clinical settings. Besides, current Med-VQA algorithms, typically
reliant on singular models, lack the robustness needed for real-world medical
diagnostics which usually require collaborative expert evaluation. To address
these shortcomings, this paper presents MedCoT, a novel hierarchical expert
verification reasoning chain method designed to enhance interpretability and
accuracy in biomedical imaging inquiries. MedCoT is predicated on two
principles: The necessity for explicit reasoning paths in Med-VQA and the
requirement for multi-expert review to formulate accurate conclusions. The
methodology involves an Initial Specialist proposing diagnostic rationales,
followed by a Follow-up Specialist who validates these rationales, and finally,
a consensus is reached through a vote among a sparse Mixture of Experts within
the locally deployed Diagnostic Specialist, which then provides the definitive
diagnosis. Experimental evaluations on four standard Med-VQA datasets
demonstrate that MedCoT surpasses existing state-of-the-art approaches,
providing significant improvements in performance and interpretability.
☆ 3D Registration in 30 Years: A Survey
Jiaqi Yang, Chu'ai Zhang, Zhengbao Wang, Xinyue Cao, Xuan Ouyang, Xiyu Zhang, Zhenxuan Zeng, Zhao Zeng, Borui Lu, Zhiyi Xia, Qian Zhang, Yulan Guo, Yanning Zhang
3D point cloud registration is a fundamental problem in computer vision,
computer graphics, robotics, remote sensing, and etc. Over the last thirty
years, we have witnessed the amazing advancement in this area with numerous
kinds of solutions. Although a handful of relevant surveys have been conducted,
their coverage is still limited. In this work, we present a comprehensive
survey on 3D point cloud registration, covering a set of sub-areas such as
pairwise coarse registration, pairwise fine registration, multi-view
registration, cross-scale registration, and multi-instance registration. The
datasets, evaluation metrics, method taxonomy, discussions of the merits and
demerits, insightful thoughts of future directions are comprehensively
presented in this survey. The regularly updated project page of the survey is
available at https://github.com/Amyyyy11/3D-Registration-in-30-Years-A-Survey.
☆ Text2Relight: Creative Portrait Relighting with Text Guidance
Junuk Cha, Mengwei Ren, Krishna Kumar Singh, He Zhang, Yannick Hold-Geoffroy, Seunghyun Yoon, HyunJoon Jung, Jae Shin Yoon, Seungryul Baek
We present a lighting-aware image editing pipeline that, given a portrait
image and a text prompt, performs single image relighting. Our model modifies
the lighting and color of both the foreground and background to align with the
provided text description. The unbounded nature in creativeness of a text
allows us to describe the lighting of a scene with any sensory features
including temperature, emotion, smell, time, and so on. However, the modeling
of such mapping between the unbounded text and lighting is extremely
challenging due to the lack of dataset where there exists no scalable data that
provides large pairs of text and relighting, and therefore, current text-driven
image editing models does not generalize to lighting-specific use cases. We
overcome this problem by introducing a novel data synthesis pipeline: First,
diverse and creative text prompts that describe the scenes with various
lighting are automatically generated under a crafted hierarchy using a large
language model (*e.g.,* ChatGPT). A text-guided image generation model creates
a lighting image that best matches the text. As a condition of the lighting
images, we perform image-based relighting for both foreground and background
using a single portrait image or a set of OLAT (One-Light-at-A-Time) images
captured from lightstage system. Particularly for the background relighting, we
represent the lighting image as a set of point lights and transfer them to
other background images. A generative diffusion model learns the synthesized
large-scale data with auxiliary task augmentation (*e.g.,* portrait delighting
and light positioning) to correlate the latent text and lighting distribution
for text-guided portrait relighting.
☆ Modelling Multi-modal Cross-interaction for ML-FSIC Based on Local Feature Selection
The aim of multi-label few-shot image classification (ML-FSIC) is to assign
semantic labels to images, in settings where only a small number of training
examples are available for each label. A key feature of the multi-label setting
is that images often have several labels, which typically refer to objects
appearing in different regions of the image. When estimating label prototypes,
in a metric-based setting, it is thus important to determine which regions are
relevant for which labels, but the limited amount of training data and the
noisy nature of local features make this highly challenging. As a solution, we
propose a strategy in which label prototypes are gradually refined. First, we
initialize the prototypes using word embeddings, which allows us to leverage
prior knowledge about the meaning of the labels. Second, taking advantage of
these initial prototypes, we then use a Loss Change Measurement~(LCM) strategy
to select the local features from the training images (i.e.\ the support set)
that are most likely to be representative of a given label. Third, we construct
the final prototype of the label by aggregating these representative local
features using a multi-modal cross-interaction mechanism, which again relies on
the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC,
NUS-WIDE, and iMaterialist show that our model substantially improves the
current state-of-the-art.
comment: Accepted in Transactions on Multimedia Computing Communications and
Applications
☆ Unified Understanding of Environment, Task, and Human for Human-Robot Interaction in Real-World Environments
To facilitate human--robot interaction (HRI) tasks in real-world scenarios,
service robots must adapt to dynamic environments and understand the required
tasks while effectively communicating with humans. To accomplish HRI in
practice, we propose a novel indoor dynamic map, task understanding system, and
response generation system. The indoor dynamic map optimizes robot behavior by
managing an occupancy grid map and dynamic information, such as furniture and
humans, in separate layers. The task understanding system targets tasks that
require multiple actions, such as serving ordered items. Task representations
that predefine the flow of necessary actions are applied to achieve highly
accurate understanding. The response generation system is executed in parallel
with task understanding to facilitate smooth HRI by informing humans of the
subsequent actions of the robot. In this study, we focused on waiter duties in
a restaurant setting as a representative application of HRI in a dynamic
environment. We developed an HRI system that could perform tasks such as
serving food and cleaning up while communicating with customers. In experiments
conducted in a simulated restaurant environment, the proposed HRI system
successfully communicated with customers and served ordered food with 90\%
accuracy. In a questionnaire administered after the experiment, the HRI system
of the robot received 4.2 points out of 5. These outcomes indicated the
effectiveness of the proposed method and HRI system in executing waiter tasks
in real-world environments.
comment: 2024 33rd IEEE International Conference on Robot and Human
Interactive Communication (RO-MAN)
☆ Towards Automatic Evaluation for Image Transcreation
Beyond conventional paradigms of translating speech and text, recently, there
has been interest in automated transcreation of images to facilitate
localization of visual content across different cultures. Attempts to define
this as a formal Machine Learning (ML) problem have been impeded by the lack of
automatic evaluation mechanisms, with previous work relying solely on human
evaluation. In this paper, we seek to close this gap by proposing a suite of
automatic evaluation metrics inspired by machine translation (MT) metrics,
categorized into: a) Object-based, b) Embedding-based, and c) VLM-based.
Drawing on theories from translation studies and real-world transcreation
practices, we identify three critical dimensions of image transcreation:
cultural relevance, semantic equivalence and visual similarity, and design our
metrics to evaluate systems along these axes. Our results show that proprietary
VLMs best identify cultural relevance and semantic equivalence, while
vision-encoder representations are adept at measuring visual similarity.
Meta-evaluation across 7 countries shows our metrics agree strongly with human
ratings, with average segment-level correlations ranging from 0.55-0.87.
Finally, through a discussion of the merits and demerits of each metric, we
offer a robust framework for automated image transcreation evaluation, grounded
in both theoretical foundations and practical application. Our code can be
found here: https://github.com/simran-khanuja/automatic-eval-transcreation
☆ Physics-Based Adversarial Attack on Near-Infrared Human Detector for Nighttime Surveillance Camera Systems
Many surveillance cameras switch between daytime and nighttime modes based on
illuminance levels. During the day, the camera records ordinary RGB images
through an enabled IR-cut filter. At night, the filter is disabled to capture
near-infrared (NIR) light emitted from NIR LEDs typically mounted around the
lens. While RGB-based AI algorithm vulnerabilities have been widely reported,
the vulnerabilities of NIR-based AI have rarely been investigated. In this
paper, we identify fundamental vulnerabilities in NIR-based image understanding
caused by color and texture loss due to the intrinsic characteristics of
clothes' reflectance and cameras' spectral sensitivity in the NIR range. We
further show that the nearly co-located configuration of illuminants and
cameras in existing surveillance systems facilitates concealing and fully
passive attacks in the physical world. Specifically, we demonstrate how
retro-reflective and insulation plastic tapes can manipulate the intensity
distribution of NIR images. We showcase an attack on the YOLO-based human
detector using binary patterns designed in the digital space (via black-box
query and searching) and then physically realized using tapes pasted onto
clothes. Our attack highlights significant reliability concerns for nighttime
surveillance systems, which are intended to enhance security. Codes Available:
https://github.com/MyNiuuu/AdvNIR
comment: Appeared in ACM MM 2023
☆ JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts
Video Action Detection (VAD) involves localizing and categorizing action
instances in videos. Videos inherently contain various information sources,
including audio, visual cues, and surrounding scene contexts. Effectively
leveraging this multi-modal information for VAD is challenging, as the model
must accurately focus on action-relevant cues. In this study, we introduce a
novel multi-modal VAD architecture called the Joint Actor-centric Visual,
Audio, Language Encoder (JoVALE). JoVALE is the first VAD method to integrate
audio and visual features with scene descriptive context derived from large
image captioning models. The core principle of JoVALE is the actor-centric
aggregation of audio, visual, and scene descriptive contexts, where
action-related cues from each modality are identified and adaptively combined.
We propose a specialized module called the Actor-centric Multi-modal Fusion
Network, designed to capture the joint interactions among actors and
multi-modal contexts through Transformer architecture. Our evaluation conducted
on three popular VAD benchmarks, AVA, UCF101-24, and JHMDB51-21, demonstrates
that incorporating multi-modal information leads to significant performance
gains. JoVALE achieves state-of-the-art performances. The code will be
available at \texttt{https://github.com/taeiin/AAAI2025-JoVALE}.
comment: Accepted to AAAI Conference on Artificial Intelligence 2025, 9 pages,
5 figures
☆ Mitigating Adversarial Attacks in LLMs through Defensive Suffix Generation
Minkyoung Kim, Yunha Kim, Hyeram Seo, Heejung Choi, Jiye Han, Gaeun Kee, Soyoung Ko, HyoJe Jung, Byeolhee Kim, Young-Hak Kim, Sanghyun Park, Tae Joon Jun
Large language models (LLMs) have exhibited outstanding performance in
natural language processing tasks. However, these models remain susceptible to
adversarial attacks in which slight input perturbations can lead to harmful or
misleading outputs. A gradient-based defensive suffix generation algorithm is
designed to bolster the robustness of LLMs. By appending carefully optimized
defensive suffixes to input prompts, the algorithm mitigates adversarial
influences while preserving the models' utility. To enhance adversarial
understanding, a novel total loss function ($L_{\text{total}}$) combining
defensive loss ($L_{\text{def}}$) and adversarial loss ($L_{\text{adv}}$)
generates defensive suffixes more effectively. Experimental evaluations
conducted on open-source LLMs such as Gemma-7B, mistral-7B, Llama2-7B, and
Llama2-13B show that the proposed method reduces attack success rates (ASR) by
an average of 11\% compared to models without defensive suffixes. Additionally,
the perplexity score of Gemma-7B decreased from 6.57 to 3.93 when applying the
defensive suffix generated by openELM-270M. Furthermore, TruthfulQA evaluations
demonstrate consistent improvements with Truthfulness scores increasing by up
to 10\% across tested configurations. This approach significantly enhances the
security of LLMs in critical applications without requiring extensive
retraining.
comment: 9 pages, 2 figures
☆ MBInception: A new Multi-Block Inception Model for Enhancing Image Processing Efficiency
Deep learning models, specifically convolutional neural networks, have
transformed the landscape of image classification by autonomously extracting
features directly from raw pixel data. This article introduces an innovative
image classification model that employs three consecutive inception blocks
within a convolutional neural networks framework, providing a comprehensive
comparative analysis with well-established architectures such as Visual
Geometry Group, Residual Network, and MobileNet. Through the utilization of
benchmark datasets, including Canadian Institute for Advanced Researc, Modified
National Institute of Standards and Technology database, and Fashion Modified
National Institute of Standards and Technology database, we assess the
performance of our proposed model in comparison to these benchmarks. The
outcomes reveal that our novel model consistently outperforms its counterparts
across diverse datasets, underscoring its effectiveness and potential for
advancing the current state-of-the-art in image classification. Evaluation
metrics further emphasize that the proposed model surpasses the other compared
architectures, thereby enhancing the efficiency of image classification on
standard datasets.
comment: 26 pages, 10 figures
☆ Optical aberrations in autonomous driving: Physics-informed parameterized temperature scaling for neural network uncertainty calibration
'A trustworthy representation of uncertainty is desirable and should be
considered as a key feature of any machine learning method' (Huellermeier and
Waegeman, 2021). This conclusion of Huellermeier et al. underpins the
importance of calibrated uncertainties. Since AI-based algorithms are heavily
impacted by dataset shifts, the automotive industry needs to safeguard its
system against all possible contingencies. One important but often neglected
dataset shift is caused by optical aberrations induced by the windshield. For
the verification of the perception system performance, requirements on the AI
performance need to be translated into optical metrics by a bijective mapping
(Braun, 2023). Given this bijective mapping it is evident that the optical
system characteristics add additional information about the magnitude of the
dataset shift. As a consequence, we propose to incorporate a physical inductive
bias into the neural network calibration architecture to enhance the robustness
and the trustworthiness of the AI target application, which we demonstrate by
using a semantic segmentation task as an example. By utilizing the Zernike
coefficient vector of the optical system as a physical prior we can
significantly reduce the mean expected calibration error in case of optical
aberrations. As a result, we pave the way for a trustworthy uncertainty
representation and for a holistic verification strategy of the perception
chain.
comment: Under review at the International Journal of Computer Vision (IJCV)
☆ MMO-IG: Multi-Class and Multi-Scale Object Image Generation for Remote Sensing
The rapid advancement of deep generative models (DGMs) has significantly
advanced research in computer vision, providing a cost-effective alternative to
acquiring vast quantities of expensive imagery. However, existing methods
predominantly focus on synthesizing remote sensing (RS) images aligned with
real images in a global layout view, which limits their applicability in RS
image object detection (RSIOD) research. To address these challenges, we
propose a multi-class and multi-scale object image generator based on DGMs,
termed MMO-IG, designed to generate RS images with supervised object labels
from global and local aspects simultaneously. Specifically, from the local
view, MMO-IG encodes various RS instances using an iso-spacing instance map
(ISIM). During the generation process, it decodes each instance region with
iso-spacing value in ISIM-corresponding to both background and foreground
instances-to produce RS images through the denoising process of diffusion
models. Considering the complex interdependencies among MMOs, we construct a
spatial-cross dependency knowledge graph (SCDKG). This ensures a realistic and
reliable multidirectional distribution among MMOs for region embedding, thereby
reducing the discrepancy between source and target domains. Besides, we propose
a structured object distribution instruction (SODI) to guide the generation of
synthesized RS image content from a global aspect with SCDKG-based ISIM
together. Extensive experimental results demonstrate that our MMO-IG exhibits
superior generation capabilities for RS images with dense MMO-supervised
labels, and RS detectors pre-trained with MMO-IG show excellent performance on
real-world datasets.
☆ When Should We Prefer State-to-Visual DAgger Over Visual Reinforcement Learning?
Learning policies from high-dimensional visual inputs, such as pixels and
point clouds, is crucial in various applications. Visual reinforcement learning
is a promising approach that directly trains policies from visual observations,
although it faces challenges in sample efficiency and computational costs. This
study conducts an empirical comparison of State-to-Visual DAgger, a two-stage
framework that initially trains a state policy before adopting online imitation
to learn a visual policy, and Visual RL across a diverse set of tasks. We
evaluate both methods across 16 tasks from three benchmarks, focusing on their
asymptotic performance, sample efficiency, and computational costs.
Surprisingly, our findings reveal that State-to-Visual DAgger does not
universally outperform Visual RL but shows significant advantages in
challenging tasks, offering more consistent performance. In contrast, its
benefits in sample efficiency are less pronounced, although it often reduces
the overall wall-clock time required for training. Based on our findings, we
provide recommendations for practitioners and hope that our results contribute
valuable perspectives for future research in visual policy learning.
comment: Accepted by The 39th Annual AAAI Conference on Artificial
Intelligence (AAAI 2025)
☆ GLCF: A Global-Local Multimodal Coherence Analysis Framework for Talking Face Generation Detection
Talking face generation (TFG) allows for producing lifelike talking videos of
any character using only facial images and accompanying text. Abuse of this
technology could pose significant risks to society, creating the urgent need
for research into corresponding detection methods. However, research in this
field has been hindered by the lack of public datasets. In this paper, we
construct the first large-scale multi-scenario talking face dataset (MSTF),
which contains 22 audio and video forgery techniques, filling the gap of
datasets in this field. The dataset covers 11 generation scenarios and more
than 20 semantic scenarios, closer to the practical application scenario of
TFG. Besides, we also propose a TFG detection framework, which leverages the
analysis of both global and local coherence in the multimodal content of TFG
videos. Therefore, a region-focused smoothness detection module (RSFDM) and a
discrepancy capture-time frame aggregation module (DCTAM) are introduced to
evaluate the global temporal coherence of TFG videos, aggregating multi-grained
spatial information. Additionally, a visual-audio fusion module (V-AFM) is
designed to evaluate audiovisual coherence within a localized temporal
perspective. Comprehensive experiments demonstrate the reasonableness and
challenges of our datasets, while also indicating the superiority of our
proposed method compared to the state-of-the-art deepfake detection approaches.
☆ VIIS: Visible and Infrared Information Synthesis for Severe Low-light Image Enhancement
Images captured in severe low-light circumstances often suffer from
significant information absence. Existing singular modality image enhancement
methods struggle to restore image regions lacking valid information. By
leveraging light-impervious infrared images, visible and infrared image fusion
methods have the potential to reveal information hidden in darkness. However,
they primarily emphasize inter-modal complementation but neglect intra-modal
enhancement, limiting the perceptual quality of output images. To address these
limitations, we propose a novel task, dubbed visible and infrared information
synthesis (VIIS), which aims to achieve both information enhancement and fusion
of the two modalities. Given the difficulty in obtaining ground truth in the
VIIS task, we design an information synthesis pretext task (ISPT) based on
image augmentation. We employ a diffusion model as the framework and design a
sparse attention-based dual-modalities residual (SADMR) conditioning mechanism
to enhance information interaction between the two modalities. This mechanism
enables features with prior knowledge from both modalities to adaptively and
iteratively attend to each modality's information during the denoising process.
Our extensive experiments demonstrate that our model qualitatively and
quantitatively outperforms not only the state-of-the-art methods in relevant
fields but also the newly designed baselines capable of both information
enhancement and fusion. The code is available at
https://github.com/Chenz418/VIIS.
comment: Accepted to WACV 2025
☆ GAGS: Granularity-Aware Feature Distillation for Language Gaussian Splatting
3D open-vocabulary scene understanding, which accurately perceives complex
semantic properties of objects in space, has gained significant attention in
recent years. In this paper, we propose GAGS, a framework that distills 2D CLIP
features into 3D Gaussian splatting, enabling open-vocabulary queries for
renderings on arbitrary viewpoints. The main challenge of distilling 2D
features for 3D fields lies in the multiview inconsistency of extracted 2D
features, which provides unstable supervision for the 3D feature field. GAGS
addresses this challenge with two novel strategies. First, GAGS associates the
prompt point density of SAM with the camera distances, which significantly
improves the multiview consistency of segmentation results. Second, GAGS
further decodes a granularity factor to guide the distillation process and this
granularity factor can be learned in a unsupervised manner to only select the
multiview consistent 2D features in the distillation process. Experimental
results on two datasets demonstrate significant performance and stability
improvements of GAGS in visual grounding and semantic segmentation, with an
inference speed 2$\times$ faster than baseline methods. The code and additional
results are available at https://pz0826.github.io/GAGS-Webpage/ .
comment: Project page: https://pz0826.github.io/GAGS-Webpage/
☆ RelationField: Relate Anything in Radiance Fields
Sebastian Koch, Johanna Wald, Mirco Colosi, Narunas Vaskevicius, Pedro Hermosilla, Federico Tombari, Timo Ropinski
Neural radiance fields are an emerging 3D scene representation and recently
even been extended to learn features for scene understanding by distilling
open-vocabulary features from vision-language models. However, current method
primarily focus on object-centric representations, supporting object
segmentation or detection, while understanding semantic relationships between
objects remains largely unexplored. To address this gap, we propose
RelationField, the first method to extract inter-object relationships directly
from neural radiance fields. RelationField represents relationships between
objects as pairs of rays within a neural radiance field, effectively extending
its formulation to include implicit relationship queries. To teach
RelationField complex, open-vocabulary relationships, relationship knowledge is
distilled from multi-modal LLMs. To evaluate RelationField, we solve
open-vocabulary 3D scene graph generation tasks and relationship-guided
instance segmentation, achieving state-of-the-art performance in both tasks.
See the project website at https://relationfield.github.io.
comment: Project page: https://relationfield.github.io
☆ G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
Evaluation metric of visual captioning is important yet not thoroughly
explored. Traditional metrics like BLEU, METEOR, CIDEr, and ROUGE often miss
semantic depth, while trained metrics such as CLIP-Score, PAC-S, and Polos are
limited in zero-shot scenarios. Advanced Language Model-based metrics also
struggle with aligning to nuanced human preferences. To address these issues,
we introduce G-VEval, a novel metric inspired by G-Eval and powered by the new
GPT-4o. G-VEval uses chain-of-thought reasoning in large multimodal models and
supports three modes: reference-free, reference-only, and combined,
accommodating both video and image inputs. We also propose MSVD-Eval, a new
dataset for video captioning evaluation, to establish a more transparent and
consistent framework for both human experts and evaluation metrics. It is
designed to address the lack of clear criteria in existing datasets by
introducing distinct dimensions of Accuracy, Completeness, Conciseness, and
Relevance (ACCR). Extensive results show that G-VEval outperforms existing
methods in correlation with human annotations, as measured by Kendall tau-b and
Kendall tau-c. This provides a flexible solution for diverse captioning tasks
and suggests a straightforward yet effective approach for large language models
to understand video content, paving the way for advancements in automated
captioning. Codes are available at https://github.com/ztangaj/gveval
☆ Consistency of Compositional Generalization across Multiple Levels
Compositional generalization is the capability of a model to understand novel
compositions composed of seen concepts. There are multiple levels of novel
compositions including phrase-phrase level, phrase-word level, and word-word
level. Existing methods achieve promising compositional generalization, but the
consistency of compositional generalization across multiple levels of novel
compositions remains unexplored. The consistency refers to that a model should
generalize to a phrase-phrase level novel composition, and
phrase-word/word-word level novel compositions that can be derived from it
simultaneously. In this paper, we propose a meta-learning based framework, for
achieving consistent compositional generalization across multiple levels. The
basic idea is to progressively learn compositions from simple to complex for
consistency. Specifically, we divide the original training set into multiple
validation sets based on compositional complexity, and introduce multiple
meta-weight-nets to generate sample weights for samples in different validation
sets. To fit the validation sets in order of increasing compositional
complexity, we optimize the parameters of each meta-weight-net independently
and sequentially in a multilevel optimization manner. We build a GQA-CCG
dataset to quantitatively evaluate the consistency. Experimental results on
visual question answering and temporal video grounding, demonstrate the
effectiveness of the proposed framework. We release GQA-CCG at
https://github.com/NeverMoreLCH/CCG.
comment: Accepted by AAAI 2025
☆ Self-control: A Better Conditional Mechanism for Masked Autoregressive Model
Autoregressive conditional image generation algorithms are capable of
generating photorealistic images that are consistent with given textual or
image conditions, and have great potential for a wide range of applications.
Nevertheless, the majority of popular autoregressive image generation methods
rely heavily on vector quantization, and the inherent discrete characteristic
of codebook presents a considerable challenge to achieving high-quality image
generation. To address this limitation, this paper introduces a novel
conditional introduction network for continuous masked autoregressive models.
The proposed self-control network serves to mitigate the negative impact of
vector quantization on the quality of the generated images, while
simultaneously enhancing the conditional control during the generation process.
In particular, the self-control network is constructed upon a continuous mask
autoregressive generative model, which incorporates multimodal conditional
information, including text and images, into a unified autoregressive sequence
in a serial manner. Through a self-attention mechanism, the network is capable
of generating images that are controllable based on specific conditions. The
self-control network discards the conventional cross-attention-based
conditional fusion mechanism and effectively unifies the conditional and
generative information within the same space, thereby facilitating more
seamless learning and fusion of multimodal features.
☆ MambaLCT: Boosting Tracking via Long-term Context State Space Model
Effectively constructing context information with long-term dependencies from
video sequences is crucial for object tracking. However, the context length
constructed by existing work is limited, only considering object information
from adjacent frames or video clips, leading to insufficient utilization of
contextual information. To address this issue, we propose MambaLCT, which
constructs and utilizes target variation cues from the first frame to the
current frame for robust tracking. First, a novel unidirectional Context Mamba
module is designed to scan frame features along the temporal dimension,
gathering target change cues throughout the entire sequence. Specifically,
target-related information in frame features is compressed into a hidden state
space through selective scanning mechanism. The target information across the
entire video is continuously aggregated into target variation cues. Next, we
inject the target change cues into the attention mechanism, providing temporal
information for modeling the relationship between the template and search
frames. The advantage of MambaLCT is its ability to continuously extend the
length of the context, capturing complete target change cues, which enhances
the stability and robustness of the tracker. Extensive experiments show that
long-term context information enhances the model's ability to perceive targets
in complex scenarios. MambaLCT achieves new SOTA performance on six benchmarks
while maintaining real-time running speeds.
☆ Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking
Visual Entity Linking (VEL) is a crucial task for achieving fine-grained
visual understanding, matching objects within images (visual mentions) to
entities in a knowledge base. Previous VEL tasks rely on textual inputs, but
writing queries for complex scenes can be challenging. Visual inputs like
clicks or bounding boxes offer a more convenient alternative. Therefore, we
propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses
pixel masks from visual inputs to refer to objects, supplementing reference
methods for VEL. To facilitate research on this task, we have constructed the
MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity
annotation framework. This dataset contains over 5 million annotations aligning
pixel-level regions with entity-level labels, which will advance visual
understanding towards fine-grained. Moreover, as pixel masks correspond to
semantic regions in an image, we enhance previous patch-interacted attention
with region-interacted attention by a visual semantic tokenization approach.
Manual evaluation results indicate that the reverse annotation framework
achieved a 94.8% annotation success rate. Experimental results show that models
trained on this dataset improved accuracy by 18 points compared to zero-shot
models. Additionally, the semantic tokenization method achieved a 5-point
accuracy improvement over the trained baseline.
comment: AAAI 2025;Dataset are released at
https://github.com/NP-NET-research/PL-VEL
☆ Robust Tracking via Mamba-based Context-aware Token Learning
How to make a good trade-off between performance and computational cost is
crucial for a tracker. However, current famous methods typically focus on
complicated and time-consuming learning that combining temporal and appearance
information by input more and more images (or features). Consequently, these
methods not only increase the model's computational source and learning burden
but also introduce much useless and potentially interfering information. To
alleviate the above issues, we propose a simple yet robust tracker that
separates temporal information learning from appearance modeling and extracts
temporal relations from a set of representative tokens rather than several
images (or features). Specifically, we introduce one track token for each frame
to collect the target's appearance information in the backbone. Then, we design
a mamba-based Temporal Module for track tokens to be aware of context by
interacting with other track tokens within a sliding window. This module
consists of a mamba layer with autoregressive characteristic and a
cross-attention layer with strong global perception ability, ensuring
sufficient interaction for track tokens to perceive the appearance changes and
movement trends of the target. Finally, track tokens serve as a guidance to
adjust the appearance feature for the final prediction in the head. Experiments
show our method is effective and achieves competitive performance on multiple
benchmarks at a real-time speed. Code and trained models will be available at
https://github.com/GXNU-ZhongLab/TemTrack.
comment: AAAI2025
☆ Faster and Stronger: When ANN-SNN Conversion Meets Parallel Spiking Calculation
Spiking Neural Network (SNN), as a brain-inspired and energy-efficient
network, is currently facing the pivotal challenge of exploring a suitable and
efficient learning framework. The predominant training methodologies, namely
Spatial-Temporal Back-propagation (STBP) and ANN-SNN Conversion, are encumbered
by substantial training overhead or pronounced inference latency, which impedes
the advancement of SNNs in scaling to larger networks and navigating intricate
application domains. In this work, we propose a novel parallel conversion
learning framework, which establishes a mathematical mapping relationship
between each time-step of the parallel spiking neurons and the cumulative spike
firing rate. We theoretically validate the lossless and sorting properties of
the conversion process, as well as pointing out the optimal shifting distance
for each step. Furthermore, by integrating the above framework with the
distribution-aware error calibration technique, we can achieve efficient
conversion towards more general activation functions or training-free
circumstance. Extensive experiments have confirmed the significant performance
advantages of our method for various conversion cases under ultra-low time
latency. To our best knowledge, this is the first work which jointly utilizes
parallel spiking calculation and ANN-SNN Conversion, providing a highly
promising approach for SNN supervised training.
☆ Sign-IDD: Iconicity Disentangled Diffusion for Sign Language Production
Sign Language Production (SLP) aims to generate semantically consistent sign
videos from textual statements, where the conversion from textual glosses to
sign poses (G2P) is a crucial step. Existing G2P methods typically treat sign
poses as discrete three-dimensional coordinates and directly fit them, which
overlooks the relative positional relationships among joints. To this end, we
provide a new perspective, constraining joint associations and gesture details
by modeling the limb bones to improve the accuracy and naturalness of the
generated poses. In this work, we propose a pioneering iconicity disentangled
diffusion framework, termed Sign-IDD, specifically designed for SLP. Sign-IDD
incorporates a novel Iconicity Disentanglement (ID) module to bridge the gap
between relative positions among joints. The ID module disentangles the
conventional 3D joint representation into a 4D bone representation, comprising
the 3D spatial direction vector and 1D spatial distance vector between adjacent
joints. Additionally, an Attribute Controllable Diffusion (ACD) module is
introduced to further constrain joint associations, in which the attribute
separation layer aims to separate the bone direction and length attributes, and
the attribute control layer is designed to guide the pose generation by
leveraging the above attributes. The ACD module utilizes the gloss embeddings
as semantic conditions and finally generates sign poses from noise embeddings.
Extensive experiments on PHOENIX14T and USTC-CSL datasets validate the
effectiveness of our method. The code is available at:
https://github.com/NaVi-start/Sign-IDD.
comment: 9 pages, 5 figures
☆ Hybrid CNN-LSTM based Indoor Pedestrian Localization with CSI Fingerprint Maps
The paper presents a novel Wi-Fi fingerprinting system that uses Channel
State Information (CSI) data for fine-grained pedestrian localization. The
proposed system exploits the frequency diversity and spatial diversity of the
features extracted from CSI data to generate a 2D+channel image termed as a CSI
Fingerprint Map. We then use this CSI Fingerprint Map representation of CSI
data to generate a pedestrian trajectory hypothesis using a hybrid architecture
that combines a Convolutional Neural Network and a Long Short-Term Memory
Recurrent Neural Network model. The proposed architecture exploits the temporal
and spatial relationship information among the CSI data observations gathered
at neighboring locations. A particle filter is then employed to separate out
the most likely hypothesis matching a human walk model. The experimental
performance of our method is compared to existing deep learning localization
methods such ConFi, DeepFi and to a self-developed temporal-feature based LSTM
based location classifier. The experimental results show marked improvement
with an average RMSE of 0.36 m in a moderately dynamic and 0.17 m in a static
environment. Our method is essentially a proof of concept that with (1) sparse
availability of observations, (2) limited infrastructure requirements, (3)
moderate level of short-term and long-term noise in the training and testing
environment, reliable fine-grained Wi-Fi based pedestrian localization is a
potential option.
comment: 12 pages, 14 figures and 3 tables
☆ Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report Generation
Jinghan Sun, Dong Wei, Zhe Xu, Donghuan Lu, Hong Liu, Hong Wang, Sotirios A. Tsaftaris, Steven McDonagh, Yefeng Zheng, Liansheng Wang
Anatomical abnormality detection and report generation of chest X-ray (CXR)
are two essential tasks in clinical practice. The former aims at localizing and
characterizing cardiopulmonary radiological findings in CXRs, while the latter
summarizes the findings in a detailed report for further diagnosis and
treatment. Existing methods often focused on either task separately, ignoring
their correlation. This work proposes a co-evolutionary abnormality detection
and report generation (CoE-DG) framework. The framework utilizes both fully
labeled (with bounding box annotations and clinical reports) and weakly labeled
(with reports only) data to achieve mutual promotion between the abnormality
detection and report generation tasks. Specifically, we introduce a
bi-directional information interaction strategy with generator-guided
information propagation (GIP) and detector-guided information propagation
(DIP). For semi-supervised abnormality detection, GIP takes the informative
feature extracted by the generator as an auxiliary input to the detector and
uses the generator's prediction to refine the detector's pseudo labels. We
further propose an intra-image-modal self-adaptive non-maximum suppression
module (SA-NMS). This module dynamically rectifies pseudo detection labels
generated by the teacher detection model with high-confidence predictions by
the student.Inversely, for report generation, DIP takes the abnormalities'
categories and locations predicted by the detector as input and guidance for
the generator to improve the generated reports.
☆ Generalizable Sensor-Based Activity Recognition via Categorical Concept Invariant Learning
Human Activity Recognition (HAR) aims to recognize activities by training
models on massive sensor data. In real-world deployment, a crucial aspect of
HAR that has been largely overlooked is that the test sets may have different
distributions from training sets due to inter-subject variability including
age, gender, behavioral habits, etc., which leads to poor generalization
performance. One promising solution is to learn domain-invariant
representations to enable a model to generalize on an unseen distribution.
However, most existing methods only consider the feature-invariance of the
penultimate layer for domain-invariant learning, which leads to suboptimal
results. In this paper, we propose a Categorical Concept Invariant Learning
(CCIL) framework for generalizable activity recognition, which introduces a
concept matrix to regularize the model in the training stage by simultaneously
concentrating on feature-invariance and logit-invariance. Our key idea is that
the concept matrix for samples belonging to the same activity category should
be similar. Extensive experiments on four public HAR benchmarks demonstrate
that our CCIL substantially outperforms the state-of-the-art approaches under
cross-person, cross-dataset, cross-position, and one-person-to-another
settings.
comment: Accepted by AAAI 2025
☆ Bridge then Begin Anew: Generating Target-relevant Intermediate Model for Source-free Visual Emotion Adaptation
Jiankun Zhu, Sicheng Zhao, Jing Jiang, Wenbo Tang, Zhaopan Xu, Tingting Han, Pengfei Xu, Hongxun Yao
Visual emotion recognition (VER), which aims at understanding humans'
emotional reactions toward different visual stimuli, has attracted increasing
attention. Given the subjective and ambiguous characteristics of emotion,
annotating a reliable large-scale dataset is hard. For reducing reliance on
data labeling, domain adaptation offers an alternative solution by adapting
models trained on labeled source data to unlabeled target data. Conventional
domain adaptation methods require access to source data. However, due to
privacy concerns, source emotional data may be inaccessible. To address this
issue, we propose an unexplored task: source-free domain adaptation (SFDA) for
VER, which does not have access to source data during the adaptation process.
To achieve this, we propose a novel framework termed Bridge then Begin Anew
(BBA), which consists of two steps: domain-bridged model generation (DMG) and
target-related model adaptation (TMA). First, the DMG bridges cross-domain gaps
by generating an intermediate model, avoiding direct alignment between two VER
datasets with significant differences. Then, the TMA begins training the target
model anew to fit the target structure, avoiding the influence of
source-specific knowledge. Extensive experiments are conducted on six SFDA
settings for VER. The results demonstrate the effectiveness of BBA, which
achieves remarkable performance gains compared with state-of-the-art SFDA
methods and outperforms representative unsupervised domain adaptation
approaches.
comment: Accepted by AAAI2025
☆ Seeking Consistent Flat Minima for Better Domain Generalization via Refining Loss Landscapes
Domain generalization aims to learn a model from multiple training domains
and generalize it to unseen test domains. Recent theory has shown that seeking
the deep models, whose parameters lie in the flat minima of the loss landscape,
can significantly reduce the out-of-domain generalization error. However,
existing methods often neglect the consistency of loss landscapes in different
domains, resulting in models that are not simultaneously in the optimal flat
minima in all domains, which limits their generalization ability. To address
this issue, this paper proposes an iterative Self-Feedback Training (SFT)
framework to seek consistent flat minima that are shared across different
domains by progressively refining loss landscapes during training. It
alternatively generates a feedback signal by measuring the inconsistency of
loss landscapes in different domains and refines these loss landscapes for
greater consistency using this feedback signal. Benefiting from the consistency
of the flat minima within these refined loss landscapes, our SFT helps achieve
better out-of-domain generalization. Extensive experiments on DomainBed
demonstrate superior performances of SFT when compared to state-of-the-art
sharpness-aware methods and other prevalent DG baselines. On average across
five DG benchmarks, SFT surpasses the sharpness-aware minimization by 2.6% with
ResNet-50 and 1.5% with ViT-B/16, respectively. The code will be available
soon.
☆ Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset
We address an advanced challenge of predicting pedestrian occupancy as an
extension of multi-view pedestrian detection in urban traffic. To support this,
we have created a new synthetic dataset called MVP-Occ, designed for dense
pedestrian scenarios in large-scale scenes. Our dataset provides detailed
representations of pedestrians using voxel structures, accompanied by rich
semantic scene understanding labels, facilitating visual navigation and
insights into pedestrian spatial information. Furthermore, we present a robust
baseline model, termed OmniOcc, capable of predicting both the voxel occupancy
state and panoptic labels for the entire scene from multi-view images. Through
in-depth analysis, we identify and evaluate the key elements of our proposed
model, highlighting their specific contributions and importance.
comment: AAAI 2025
☆ CA-Edit: Causality-Aware Condition Adapter for High-Fidelity Local Facial Attribute Editing
Xiaole Xian, Xilin He, Zenghao Niu, Junliang Zhang, Weicheng Xie, Siyang Song, Zitong Yu, Linlin Shen
For efficient and high-fidelity local facial attribute editing, most existing
editing methods either require additional fine-tuning for different editing
effects or tend to affect beyond the editing regions. Alternatively, inpainting
methods can edit the target image region while preserving external areas.
However, current inpainting methods still suffer from the generation
misalignment with facial attributes description and the loss of facial skin
details. To address these challenges, (i) a novel data utilization strategy is
introduced to construct datasets consisting of attribute-text-image triples
from a data-driven perspective, (ii) a Causality-Aware Condition Adapter is
proposed to enhance the contextual causality modeling of specific details,
which encodes the skin details from the original image while preventing
conflicts between these cues and textual conditions. In addition, a Skin
Transition Frequency Guidance technique is introduced for the local modeling of
contextual causality via sampling guidance driven by low-frequency alignment.
Extensive quantitative and qualitative experiments demonstrate the
effectiveness of our method in boosting both fidelity and editability for
localized attribute editing. The code is available at
https://github.com/connorxian/CA-Edit.
comment: accepted by aaai
☆ Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation
Changsun Lee, Sangjoon Park, Cheong-Il Shin, Woo Hee Choi, Hyun Jeong Park, Jeong Eun Lee, Jong Chul Ye
Recent medical vision-language models (VLMs) have shown promise in 2D medical
image interpretation. However extending them to 3D medical imaging has been
challenging due to computational complexities and data scarcity. Although a few
recent VLMs specified for 3D medical imaging have emerged, all are limited to
learning volumetric representation of a 3D medical image as a set of
sub-volumetric features. Such process introduces overly correlated
representations along the z-axis that neglect slice-specific clinical details,
particularly for 3D medical images where adjacent slices have low redundancy.
To address this limitation, we introduce MS-VLM that mimic radiologists'
workflow in 3D medical image interpretation. Specifically, radiologists analyze
3D medical images by examining individual slices sequentially and synthesizing
information across slices and views. Likewise, MS-VLM leverages self-supervised
2D transformer encoders to learn a volumetric representation that capture
inter-slice dependencies from a sequence of slice-specific features. Unbound by
sub-volumetric patchification, MS-VLM is capable of obtaining useful volumetric
representations from 3D medical images with any slice length and from multiple
images acquired from different planes and phases. We evaluate MS-VLM on
publicly available chest CT dataset CT-RATE and in-house rectal MRI dataset. In
both scenarios, MS-VLM surpasses existing methods in radiology report
generation, producing more coherent and clinically relevant reports. These
findings highlight the potential of MS-VLM to advance 3D medical image
interpretation and improve the robustness of medical VLMs.
☆ DragScene: Interactive 3D Scene Editing with Single-view Drag Instructions
3D editing has shown remarkable capability in editing scenes based on various
instructions. However, existing methods struggle with achieving intuitive,
localized editing, such as selectively making flowers blossom. Drag-style
editing has shown exceptional capability to edit images with direct
manipulation instead of ambiguous text commands. Nevertheless, extending
drag-based editing to 3D scenes presents substantial challenges due to
multi-view inconsistency. To this end, we introduce DragScene, a framework that
integrates drag-style editing with diverse 3D representations. First, latent
optimization is performed on a reference view to generate 2D edits based on
user instructions. Subsequently, coarse 3D clues are reconstructed from the
reference view using a point-based representation to capture the geometric
details of the edits. The latent representation of the edited view is then
mapped to these 3D clues, guiding the latent optimization of other views. This
process ensures that edits are propagated seamlessly across multiple views,
maintaining multi-view consistency. Finally, the target 3D scene is
reconstructed from the edited multi-view images. Extensive experiments
demonstrate that DragScene facilitates precise and flexible drag-style editing
of 3D scenes, supporting broad applicability across diverse 3D representations.
☆ Turbo-GS: Accelerating 3D Gaussian Fitting for High-Quality Radiance Fields
Tao Lu, Ankit Dhiman, R Srinath, Emre Arslan, Angela Xing, Yuanbo Xiangli, R Venkatesh Babu, Srinath Sridhar
Novel-view synthesis is an important problem in computer vision with
applications in 3D reconstruction, mixed reality, and robotics. Recent methods
like 3D Gaussian Splatting (3DGS) have become the preferred method for this
task, providing high-quality novel views in real time. However, the training
time of a 3DGS model is slow, often taking 30 minutes for a scene with 200
views. In contrast, our goal is to reduce the optimization time by training for
fewer steps while maintaining high rendering quality. Specifically, we combine
the guidance from both the position error and the appearance error to achieve a
more effective densification. To balance the rate between adding new Gaussians
and fitting old Gaussians, we develop a convergence-aware budget control
mechanism. Moreover, to make the densification process more reliable, we
selectively add new Gaussians from mostly visited regions. With these designs,
we reduce the Gaussian optimization steps to one-third of the previous approach
while achieving a comparable or even better novel view rendering quality. To
further facilitate the rapid fitting of 4K resolution images, we introduce a
dilation-based rendering technique. Our method, Turbo-GS, speeds up
optimization for typical scenes and scales well to high-resolution (4K)
scenarios on standard datasets. Through extensive experiments, we show that our
method is significantly faster in optimization than other methods while
retaining quality. Project page: https://ivl.cs.brown.edu/research/turbo-gs.
comment: Project page: https://ivl.cs.brown.edu/research/turbo-gs
☆ Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Video has emerged as a favored multimedia format on the internet. To better
gain video contents, a new topic HIREST is presented, including video
retrieval, moment retrieval, moment segmentation, and step-captioning. The
pioneering work chooses the pre-trained CLIP-based model for video retrieval,
and leverages it as a feature extractor for other three challenging tasks
solved in a multi-task learning paradigm. Nevertheless, this work struggles to
learn the comprehensive cognition of user-preferred content, due to
disregarding the hierarchies and association relations across modalities. In
this paper, guided by the shallow-to-deep principle, we propose a query-centric
audio-visual cognition (QUAG) network to construct a reliable multi-modal
representation for moment retrieval, segmentation and step-captioning.
Specifically, we first design the modality-synergistic perception to obtain
rich audio-visual content, by modeling global contrastive alignment and local
fine-grained interaction between visual and audio modalities. Then, we devise
the query-centric cognition that uses the deep-level query to perform the
temporal-channel filtration on the shallow-level audio-visual representation.
This can cognize user-preferred content and thus attain a query-centric
audio-visual representation for three tasks. Extensive experiments show QUAG
achieves the SOTA results on HIREST. Further, we test QUAG on the query-based
video summarization task and verify its good generalization.
comment: Accepted by AAAI 2025
☆ Spatio-Temporal Fuzzy-oriented Multi-Modal Meta-Learning for Fine-grained Emotion Recognition
Fine-grained emotion recognition (FER) plays a vital role in various fields,
such as disease diagnosis, personalized recommendations, and multimedia mining.
However, existing FER methods face three key challenges in real-world
applications: (i) they rely on large amounts of continuously annotated data to
ensure accuracy since emotions are complex and ambiguous in reality, which is
costly and time-consuming; (ii) they cannot capture the temporal heterogeneity
caused by changing emotion patterns, because they usually assume that the
temporal correlation within sampling periods is the same; (iii) they do not
consider the spatial heterogeneity of different FER scenarios, that is, the
distribution of emotion information in different data may have bias or
interference. To address these challenges, we propose a Spatio-Temporal
Fuzzy-oriented Multi-modal Meta-learning framework (ST-F2M). Specifically,
ST-F2M first divides the multi-modal videos into multiple views, and each view
corresponds to one modality of one emotion. Multiple randomly selected views
for the same emotion form a meta-training task. Next, ST-F2M uses an integrated
module with spatial and temporal convolutions to encode the data of each task,
reflecting the spatial and temporal heterogeneity. Then it adds fuzzy semantic
information to each task based on generalized fuzzy rules, which helps handle
the complexity and ambiguity of emotions. Finally, ST-F2M learns
emotion-related general meta-knowledge through meta-recurrent neural networks
to achieve fast and robust fine-grained emotion recognition. Extensive
experiments show that ST-F2M outperforms various state-of-the-art methods in
terms of accuracy and model efficiency. In addition, we construct ablation
studies and further analysis to explore why ST-F2M performs well.
comment: 13 pages, Submitted to TMM in 30-May-2024
☆ Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance
across diverse tasks. Despite great success, recent studies show that LVLMs
encounter substantial limitations when engaging with visual graphs. To study
the reason behind these limitations, we propose VGCure, a comprehensive
benchmark covering 22 tasks for examining the fundamental graph understanding
and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs
reveal that LVLMs are weak in basic graph understanding and reasoning tasks,
particularly those concerning relational or structurally complex information.
Based on this observation, we propose a structure-aware fine-tuning framework
to enhance LVLMs with structure learning abilities through 3 self-supervised
learning tasks. Experiments validate the effectiveness of our method in
improving LVLMs' zero-shot performance on fundamental graph learning tasks, as
well as enhancing the robustness of LVLMs against complex visual graphs.
☆ Language-guided Medical Image Segmentation with Target-informed Multi-level Contrastive Alignments
Medical image segmentation is crucial in modern medical image analysis, which
can aid into diagnosis of various disease conditions. Recently, language-guided
segmentation methods have shown promising results in automating image
segmentation where text reports are incorporated as guidance. These text
reports, containing image impressions and insights given by clinicians,
provides auxiliary guidance. However, these methods neglect the inherent
pattern gaps between the two distinct modalities, which leads to sub-optimal
image-text feature fusion without proper cross-modality feature alignments.
Contrastive alignments are widely used to associate image-text semantics in
representation learning; however, it has not been exploited to bridge the
pattern gaps in language-guided segmentation that relies on subtle low level
image details to represent diseases. Existing contrastive alignment methods
typically algin high-level global image semantics without involving low-level,
localized target information, and therefore fails to explore fine-grained text
guidance for language-guided segmentation. In this study, we propose a
language-guided segmentation network with Target-informed Multi-level
Contrastive Alignments (TMCA). TMCA enables target-informed cross-modality
alignments and fine-grained text guidance to bridge the pattern gaps in
language-guided segmentation. Specifically, we introduce: 1) a target-sensitive
semantic distance module that enables granular image-text alignment modelling,
and 2) a multi-level alignment strategy that directs text guidance on low-level
image features. In addition, a language-guided target enhancement module is
proposed to leverage the aligned text to redirect attention to focus on
critical localized image features. Extensive experiments on 4 image-text
datasets, involving 3 medical imaging modalities, demonstrated that our TMCA
achieved superior performances.
☆ Hybrid Data-Free Knowledge Distillation
Data-free knowledge distillation aims to learn a compact student network from
a pre-trained large teacher network without using the original training data of
the teacher network. Existing collection-based and generation-based methods
train student networks by collecting massive real examples and generating
synthetic examples, respectively. However, they inevitably become weak in
practical scenarios due to the difficulties in gathering or emulating
sufficient real-world data. To solve this problem, we propose a novel method
called \textbf{H}ybr\textbf{i}d \textbf{D}ata-\textbf{F}ree
\textbf{D}istillation (HiDFD), which leverages only a small amount of collected
data as well as generates sufficient examples for training student networks.
Our HiDFD comprises two primary modules, \textit{i.e.}, the teacher-guided
generation and student distillation. The teacher-guided generation module
guides a Generative Adversarial Network (GAN) by the teacher network to produce
high-quality synthetic examples from very few real-world collected examples.
Specifically, we design a feature integration mechanism to prevent the GAN from
overfitting and facilitate the reliable representation learning from the
teacher network. Meanwhile, we drive a category frequency smoothing technique
via the teacher network to balance the generative training of each category. In
the student distillation module, we explore a data inflation strategy to
properly utilize a blend of real and synthetic data to train the student
network via a classifier-sharing-based feature alignment technique. Intensive
experiments across multiple benchmarks demonstrate that our HiDFD can achieve
state-of-the-art performance using 120 times less collected data than existing
methods. Code is available at https://github.com/tangjialiang97/HiDFD.
♻ ☆ SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models
Given an input video of a person and a new garment, the objective of this
paper is to synthesize a new video where the person is wearing the specified
garment while maintaining spatiotemporal consistency. Although significant
advances have been made in image-based virtual try-on, extending these
successes to video often leads to frame-to-frame inconsistencies. Some
approaches have attempted to address this by increasing the overlap of frames
across multiple video chunks, but this comes at a steep computational cost due
to the repeated processing of the same frames, especially for long video
sequences. To tackle these challenges, we reconceptualize video virtual try-on
as a conditional video inpainting task, with garments serving as input
conditions. Specifically, our approach enhances image diffusion models by
incorporating temporal attention layers to improve temporal coherence. To
reduce computational overhead, we propose ShiftCaching, a novel technique that
maintains temporal consistency while minimizing redundant computations.
Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset
featuring more complex backgrounds, challenging movements, and higher
resolution compared to existing public datasets. Extensive experiments
demonstrate that our approach outperforms current baselines, particularly in
terms of video consistency and inference speed. The project page is available
at https://swift-try.github.io/.
♻ ☆ EvalGIM: A Library for Evaluating Generative Image Models
Melissa Hall, Oscar Mañas, Reyhane Askari-Hemmat, Mark Ibrahim, Candace Ross, Pietro Astolfi, Tariq Berrada Ifriqi, Marton Havasi, Yohann Benchetrit, Karen Ullrich, Carolina Braga, Abhishek Charnalia, Maeve Ryan, Mike Rabbat, Michal Drozdzal, Jakob Verbeek, Adriana Romero-Soriano
As the use of text-to-image generative models increases, so does the adoption
of automatic benchmarking methods used in their evaluation. However, while
metrics and datasets abound, there are few unified benchmarking libraries that
provide a framework for performing evaluations across many datasets and
metrics. Furthermore, the rapid introduction of increasingly robust
benchmarking methods requires that evaluation libraries remain flexible to new
datasets and metrics. Finally, there remains a gap in synthesizing evaluations
in order to deliver actionable takeaways about model performance. To enable
unified, flexible, and actionable evaluations, we introduce EvalGIM (pronounced
''EvalGym''), a library for evaluating generative image models. EvalGIM
contains broad support for datasets and metrics used to measure quality,
diversity, and consistency of text-to-image generative models. In addition,
EvalGIM is designed with flexibility for user customization as a top priority
and contains a structure that allows plug-and-play additions of new datasets
and metrics. To enable actionable evaluation insights, we introduce
''Evaluation Exercises'' that highlight takeaways for specific evaluation
questions. The Evaluation Exercises contain easy-to-use and reproducible
implementations of two state-of-the-art evaluation methods of text-to-image
generative models: consistency-diversity-realism Pareto Fronts and
disaggregated measurements of performance disparities across groups. EvalGIM
also contains Evaluation Exercises that introduce two new analysis methods for
text-to-image generative models: robustness analyses of model rankings and
balanced evaluations across different prompt styles. We encourage text-to-image
model exploration with EvalGIM and invite contributions at
https://github.com/facebookresearch/EvalGIM/.
comment: For code, see https://github.com/facebookresearch/EvalGIM/tree/main
♻ ☆ Restore Anything Model via Efficient Degradation Adaptation
Bin Ren, Eduard Zamfir, Zongwei Wu, Yawei Li, Yidi Li, Danda Pani Paudel, Radu Timofte, Ming-Hsuan Yang, Nicu Sebe
With the proliferation of mobile devices, the need for an efficient model to
restore any degraded image has become increasingly significant and impactful.
Traditional approaches typically involve training dedicated models for each
specific degradation, resulting in inefficiency and redundancy. More recent
solutions either introduce additional modules to learn visual prompts
significantly increasing model size or incorporate cross-modal transfer from
large language models trained on vast datasets, adding complexity to the system
architecture. In contrast, our approach, termed RAM, takes a unified path that
leverages inherent similarities across various degradations to enable both
efficient and comprehensive restoration through a joint embedding mechanism
without scaling up the model or relying on large multimodal models.
Specifically, we examine the sub-latent space of each input, identifying key
components and reweighting them in a gated manner. This intrinsic degradation
awareness is further combined with contextualized attention in an X-shaped
framework, enhancing local-global interactions. Extensive benchmarking in an
all-in-one restoration setting confirms RAM's SOTA performance, reducing model
complexity by approximately 82% in trainable parameters and 85% in FLOPs. Our
code and models will be publicly available.
comment: Efficient Any Image Restoration
♻ ☆ Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge NeurIPS 2024
Contrastive Language-Image Pretraining (CLIP) performs zero-shot image
classification by mapping images and textual class representation into a shared
embedding space, then retrieving the class closest to the image. This work
provides a new approach for interpreting CLIP models for image classification
from the lens of mutual knowledge between the two modalities. Specifically, we
ask: what concepts do both vision and language CLIP encoders learn in common
that influence the joint embedding space, causing points to be closer or
further apart? We answer this question via an approach of textual concept-based
explanations, showing their effectiveness, and perform an analysis encompassing
a pool of 13 CLIP models varying in architecture, size and pretraining
datasets. We explore those different aspects in relation to mutual knowledge,
and analyze zero-shot predictions. Our approach demonstrates an effective and
human-friendly way of understanding zero-shot classification decisions with
CLIP.
comment: Accepted to NeurIPS 2024
♻ ☆ CNNtention: Can CNNs do better with Attention?
Convolutional Neural Networks (CNNs) have been the standard for image
classification tasks for a long time, but more recently attention-based
mechanisms have gained traction. This project aims to compare traditional CNNs
with attention-augmented CNNs across an image classification task. By
evaluating and comparing their performance, accuracy and computational
efficiency, the project will highlight benefits and trade-off of the localized
feature extraction of traditional CNNs and the global context capture in
attention-augmented CNNs. By doing this, we can reveal further insights into
their respective strengths and weaknesses, guide the selection of models based
on specific application needs and ultimately, enhance understanding of these
architectures in the deep learning community.
This was our final project for CS7643 Deep Learning course at Georgia Tech.
comment: 10 pages, 11 figures
♻ ☆ Benchmarking Pretrained Attention-based Models for Real-Time Recognition in Robot-Assisted Esophagectomy
Ronald L. P. D. de Jong, Yasmina al Khalil, Tim J. M. Jaspers, Romy C. van Jaarsveld, Gino M. Kuiper, Yiping Li, Richard van Hillegersberg, Jelle P. Ruurda, Marcel Breeuwer, Fons van der Sommen
Esophageal cancer is among the most common types of cancer worldwide. It is
traditionally treated using open esophagectomy, but in recent years,
robot-assisted minimally invasive esophagectomy (RAMIE) has emerged as a
promising alternative. However, robot-assisted surgery can be challenging for
novice surgeons, as they often suffer from a loss of spatial orientation.
Computer-aided anatomy recognition holds promise for improving surgical
navigation, but research in this area remains limited. In this study, we
developed a comprehensive dataset for semantic segmentation in RAMIE, featuring
the largest collection of vital anatomical structures and surgical instruments
to date. Handling this diverse set of classes presents challenges, including
class imbalance and the recognition of complex structures such as nerves. This
study aims to understand the challenges and limitations of current
state-of-the-art algorithms on this novel dataset and problem. Therefore, we
benchmarked eight real-time deep learning models using two pretraining
datasets. We assessed both traditional and attention-based networks,
hypothesizing that attention-based networks better capture global patterns and
address challenges such as occlusion caused by blood or other tissues. The
benchmark includes our RAMIE dataset and the publicly available CholecSeg8k
dataset, enabling a thorough assessment of surgical segmentation tasks. Our
findings indicate that pretraining on ADE20k, a dataset for semantic
segmentation, is more effective than pretraining on ImageNet. Furthermore,
attention-based models outperform traditional convolutional neural networks,
with SegNeXt and Mask2Former achieving higher Dice scores, and Mask2Former
additionally excelling in average symmetric surface distance.
comment: Accepted for presentation at the SPIE Medical Imaging Conference,
2025
♻ ☆ HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction
How can we predict future interaction trajectories of human hands in a scene
given high-level colloquial task specifications in the form of natural
language? In this paper, we extend the classic hand trajectory prediction task
to two tasks involving explicit or implicit language queries. Our proposed
tasks require extensive understanding of human daily activities and reasoning
abilities about what should be happening next given cues from the current
scene. We also develop new benchmarks to evaluate the proposed two tasks,
Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We
enable solving these tasks by integrating high-level world knowledge and
reasoning capabilities of Vision-Language Models (VLMs) with the
auto-regressive nature of low-level ego-centric hand trajectories. Our model,
HandsOnVLM is a novel VLM that can generate textual responses and produce
future hand trajectories through natural-language conversations. Our
experiments show that HandsOnVLM outperforms existing task-specific methods and
other VLM baselines on proposed tasks, and demonstrates its ability to
effectively utilize world knowledge for reasoning about low-level human hand
trajectories based on the provided context. Our website contains code and
detailed video results https://www.chenbao.tech/handsonvlm/
comment: Preprint. Under Review
♻ ☆ AGFSync: Leveraging AI-Generated Feedback for Preference Optimization in Text-to-Image Generation
Jingkun An, Yinghao Zhu, Zongjian Li, Enshen Zhou, Haoran Feng, Xijie Huang, Bohua Chen, Yemin Shi, Chengwei Pan
Text-to-Image (T2I) diffusion models have achieved remarkable success in
image generation. Despite their progress, challenges remain in both
prompt-following ability, image quality and lack of high-quality datasets,
which are essential for refining these models. As acquiring labeled data is
costly, we introduce AGFSync, a framework that enhances T2I diffusion models
through Direct Preference Optimization (DPO) in a fully AI-driven approach.
AGFSync utilizes Vision-Language Models (VLM) to assess image quality across
style, coherence, and aesthetics, generating feedback data within an AI-driven
loop. By applying AGFSync to leading T2I models such as SD v1.4, v1.5, and
SDXL-base, our extensive experiments on the TIFA dataset demonstrate notable
improvements in VQA scores, aesthetic evaluations, and performance on the HPSv2
benchmark, consistently outperforming the base models. AGFSync's method of
refining T2I diffusion models paves the way for scalable alignment techniques.
Our code and dataset are publicly available at
https://anjingkun.github.io/AGFSync.
comment: Accepted by AAAI-2025
♻ ☆ Image Synthesis under Limited Data: A Survey and Taxonomy
Deep generative models, which target reproducing the given data distribution
to produce novel samples, have made unprecedented advancements in recent years.
Their technical breakthroughs have enabled unparalleled quality in the
synthesis of visual content. However, one critical prerequisite for their
tremendous success is the availability of a sufficient number of training
samples, which requires massive computation resources. When trained on limited
data, generative models tend to suffer from severe performance deterioration
due to overfitting and memorization. Accordingly, researchers have devoted
considerable attention to develop novel models that are capable of generating
plausible and diverse images from limited training data recently. Despite
numerous efforts to enhance training stability and synthesis quality in the
limited data scenarios, there is a lack of a systematic survey that provides 1)
a clear problem definition, critical challenges, and taxonomy of various tasks;
2) an in-depth analysis on the pros, cons, and remain limitations of existing
literature; as well as 3) a thorough discussion on the potential applications
and future directions in the field of image synthesis under limited data. In
order to fill this gap and provide a informative introduction to researchers
who are new to this topic, this survey offers a comprehensive review and a
novel taxonomy on the development of image synthesis under limited data. In
particular, it covers the problem definition, requirements, main solutions,
popular benchmarks, and remain challenges in a comprehensive and all-around
manner.
comment: 230 references, 25 pages. GitHub:
https://github.com/kobeshegu/awesome-few-shot-generation
♻ ☆ Towards Deployable OCR models for Indic languages
Recognition of text on word or line images, without the need for sub-word
segmentation has become the mainstream of research and development of text
recognition for Indian languages. Modelling unsegmented sequences using
Connectionist Temporal Classification (CTC) is the most commonly used approach
for segmentation-free OCR. In this work we present a comprehensive empirical
study of various neural network models that uses CTC for transcribing step-wise
predictions in the neural network output to a Unicode sequence. The study is
conducted for 13 Indian languages, using an internal dataset that has around
1000 pages per language. We study the choice of line vs word as the recognition
unit, and use of synthetic data to train the models. We compare our models with
popular publicly available OCR tools for end-to-end document image recognition.
Our end-to-end pipeline that employ our recognition models and existing text
segmentation tools outperform these public OCR tools for 8 out of the 13
languages. We also introduce a new public dataset called Mozhi for word and
line recognition in Indian language. The dataset contains more than 1.2 million
annotated word images (120 thousand text lines) across 13 Indian languages. Our
code, trained models and the Mozhi dataset will be made available at
http://cvit.iiit.ac.in/research/projects/cvit-projects/
comment: presented at ICPR 2024;
https://link.springer.com/chapter/10.1007/978-3-031-78495-8_11
♻ ☆ Sharing Key Semantics in Transformer Makes Efficient Image Restoration NeurIPS2024
Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc Van Gool, Ming-Hsuan Yang, Nicu Sebe
Image Restoration (IR), a classic low-level vision task, has witnessed
significant advancements through deep models that effectively model global
information. Notably, the emergence of Vision Transformers (ViTs) has further
propelled these advancements. When computing, the self-attention mechanism, a
cornerstone of ViTs, tends to encompass all global cues, even those from
semantically unrelated objects or regions. This inclusivity introduces
computational inefficiencies, particularly noticeable with high input
resolution, as it requires processing irrelevant information, thereby impeding
efficiency. Additionally, for IR, it is commonly noted that small segments of a
degraded image, particularly those closely aligned semantically, provide
particularly relevant information to aid in the restoration process, as they
contribute essential contextual cues crucial for accurate reconstruction. To
address these challenges, we propose boosting IR's performance by sharing the
key semantics via Transformer for IR (\ie, SemanIR) in this paper.
Specifically, SemanIR initially constructs a sparse yet comprehensive
key-semantic dictionary within each transformer stage by establishing essential
semantic connections for every degraded patch. Subsequently, this dictionary is
shared across all subsequent transformer blocks within the same stage. This
strategy optimizes attention calculation within each block by focusing
exclusively on semantically related components stored in the key-semantic
dictionary. As a result, attention calculation achieves linear computational
complexity within each window. Extensive experiments across 6 IR tasks confirm
the proposed SemanIR's state-of-the-art performance, quantitatively and
qualitatively showcasing advancements. The visual results, code, and trained
models are available at https://github.com/Amazingren/SemanIR.
comment: Accepted by NeurIPS2024
♻ ☆ Signal Reconstruction from Samples at Unknown Locations with Application to 2D Unknown View Tomography
It is well known that a band-limited signal can be reconstructed from its
uniformly spaced samples if the sampling rate is sufficiently high. More
recently, it has been proved that one can reconstruct a 1D band-limited signal
even if the exact sample locations are unknown, but given a uniform
distribution of the sample locations and their ordering in 1D. In this work, we
extend the analytical error bounds in such scenarios for quasi-bandlimited
(QBL) signals, and for the case of arbitrary but known sampling distributions.
We also prove that such reconstruction methods are resilient to a certain
proportion of errors in the specification of the sample location ordering. We
then express the problem of tomographic reconstruction of 2D images from 1D
Radon projections under unknown angles (2D UVT) with known angle distribution,
as a special case for reconstruction of QBL signals from samples at unknown
locations with known distribution. Building upon our theoretical background, we
present asymptotic bounds for 2D QBL image reconstruction from 1D Radon
projections in the unknown angles setting, and present an extensive set of
simulations to verify these bounds in varied parameter regimes. To the best of
our knowledge, this is the first piece of work to perform such an analysis for
2D UVT and explicitly relate it to advances in sampling theory, even though the
associated reconstruction algorithms have been known for a long time.
comment: This is a preprint of a paper accepted to Signal Processing
(Elsevier)
♻ ☆ Clothes-Changing Person Re-Identification with Feasibility-Aware Intermediary Matching
Current clothes-changing person re-identification (re-id) approaches usually
perform retrieval based on clothes-irrelevant features, while neglecting the
potential of clothes-relevant features. However, we observe that relying solely
on clothes-irrelevant features for clothes-changing re-id is limited, since
they often lack adequate identity information and suffer from large intra-class
variations. On the contrary, clothes-relevant features can be used to discover
same-clothes intermediaries that possess informative identity clues. Based on
this observation, we propose a Feasibility-Aware Intermediary Matching (FAIM)
framework to additionally utilize clothes-relevant features for retrieval.
Firstly, an Intermediary Matching (IM) module is designed to perform an
intermediary-assisted matching process. This process involves using
clothes-relevant features to find informative intermediates, and then using
clothes-irrelevant features of these intermediates to complete the matching.
Secondly, in order to reduce the negative effect of low-quality intermediaries,
an Intermediary-Based Feasibility Weighting (IBFW) module is designed to
evaluate the feasibility of intermediary matching process by assessing the
quality of intermediaries. Extensive experiments demonstrate that our method
outperforms state-of-the-art methods on several widely-used clothes-changing
re-id benchmarks.
♻ ☆ DreamPhysics: Learning Physics-Based 3D Dynamics with Video Diffusion Priors
Dynamic 3D interaction has been attracting a lot of attention recently.
However, creating such 4D content remains challenging. One solution is to
animate 3D scenes with physics-based simulation, which requires manually
assigning precise physical properties to the object or the simulated results
would become unnatural. Another solution is to learn the deformation of 3D
objects with the distillation of video generative models, which, however, tends
to produce 3D videos with small and discontinuous motions due to the
inappropriate extraction and application of physics priors. In this work, to
combine the strengths and complementing shortcomings of the above two
solutions, we propose to learn the physical properties of a material field with
video diffusion priors, and then utilize a physics-based Material-Point-Method
(MPM) simulator to generate 4D content with realistic motions. In particular,
we propose motion distillation sampling to emphasize video motion information
during distillation. In addition, to facilitate the optimization, we further
propose a KAN-based material field with frame boosting. Experimental results
demonstrate that our method enjoys more realistic motions than
state-of-the-arts do.
comment: Accepted by AAAI 2025. Codes are released at:
https://github.com/tyhuang0428/DreamPhysics
♻ ☆ CREST: An Efficient Conjointly-trained Spike-driven Framework for Event-based Object Detection Exploiting Spatiotemporal Dynamics
Event-based cameras feature high temporal resolution, wide dynamic range, and
low power consumption, which is ideal for high-speed and low-light object
detection. Spiking neural networks (SNNs) are promising for event-based object
recognition and detection due to their spiking nature but lack efficient
training methods, leading to gradient vanishing and high computational
complexity, especially in deep SNNs. Additionally, existing SNN frameworks
often fail to effectively handle multi-scale spatiotemporal features, leading
to increased data redundancy and reduced accuracy. To address these issues, we
propose CREST, a novel conjointly-trained spike-driven framework to exploit
spatiotemporal dynamics in event-based object detection. We introduce the
conjoint learning rule to accelerate SNN learning and alleviate gradient
vanishing. It also supports dual operation modes for efficient and flexible
implementation on different hardware types. Additionally, CREST features a
fully spike-driven framework with a multi-scale spatiotemporal event integrator
(MESTOR) and a spatiotemporal-IoU (ST-IoU) loss. Our approach achieves superior
object recognition & detection performance and up to 100X energy efficiency
compared with state-of-the-art SNN algorithms on three datasets, providing an
efficient solution for event-based object detection algorithms suitable for SNN
hardware implementation.
comment: Accepted by AAAI 2025
♻ ☆ ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models NeurIPS 2024
Mingrui Wu, Xinyue Cai, Jiayi Ji, Jiale Li, Oucheng Huang, Gen Luo, Hao Fei, Guannan Jiang, Xiaoshuai Sun, Rongrong Ji
In this work, we propose a training-free method to inject visual referring
into Multimodal Large Language Models (MLLMs) through learnable visual token
optimization. We observe the relationship between text prompt tokens and visual
tokens in MLLMs, where attention layers model the connection between them. Our
approach involves adjusting visual tokens from the MLP output during inference,
controlling which text prompt tokens attend to which visual tokens. We optimize
a learnable visual token based on an energy function, enhancing the strength of
referential regions in the attention map. This enables detailed region
description and reasoning without the need for substantial training costs or
model retraining. Our method offers a promising direction for integrating
referential abilities into MLLMs. Our method support referring with box, mask,
scribble and point. The results demonstrate that our method exhibits
controllability and interpretability.
comment: Accepted to NeurIPS 2024;
Code:https://github.com/mrwu-mac/ControlMLLM
♻ ☆ MoGenTS: Motion Generation based on Spatial-Temporal Joint Modeling NeurIPS 2024
Weihao Yuan, Weichao Shen, Yisheng He, Yuan Dong, Xiaodong Gu, Zilong Dong, Liefeng Bo, Qixing Huang
Motion generation from discrete quantization offers many advantages over
continuous regression, but at the cost of inevitable approximation errors.
Previous methods usually quantize the entire body pose into one code, which not
only faces the difficulty in encoding all joints within one vector but also
loses the spatial relationship between different joints. Differently, in this
work we quantize each individual joint into one vector, which i) simplifies the
quantization process as the complexity associated with a single joint is
markedly lower than that of the entire pose; ii) maintains a spatial-temporal
structure that preserves both the spatial relationships among joints and the
temporal movement patterns; iii) yields a 2D token map, which enables the
application of various 2D operations widely used in 2D images. Grounded in the
2D motion quantization, we build a spatial-temporal modeling framework, where
2D joint VQVAE, temporal-spatial 2D masking technique, and spatial-temporal 2D
attention are proposed to take advantage of spatial-temporal signals among the
2D tokens. Extensive experiments demonstrate that our method significantly
outperforms previous methods across different datasets, with a 26.6% decrease
of FID on HumanML3D and a 29.9% decrease on KIT-ML. Project page:
https://aigc3d.github.io/mogents.
comment: Accepted to NeurIPS 2024
♻ ☆ Standardizing Generative Face Video Compression using Supplemental Enhancement Information
Bolin Chen, Yan Ye, Jie Chen, Ru-Ling Liao, Shanzhi Yin, Shiqi Wang, Kaifa Yang, Yue Li, Yiling Xu, Ye-Kui Wang, Shiv Gehlot, Guan-Ming Su, Peng Yin, Sean McCarthy, Gary J. Sullivan
This paper proposes a Generative Face Video Compression (GFVC) approach using
Supplemental Enhancement Information (SEI), where a series of compact spatial
and temporal representations of a face video signal (i.e., 2D/3D key-points,
facial semantics and compact features) can be coded using SEI message and
inserted into the coded video bitstream. At the time of writing, the proposed
GFVC approach using SEI messages has been adopted into the official working
draft of Versatile Supplemental Enhancement Information (VSEI) standard by the
Joint Video Experts Team (JVET) of ISO/IEC JTC 1/SC 29 and ITU-T SG16, which
will be standardized as a new version for "ITU-T H.274 | ISO/IEC 23002-7". To
the best of the authors' knowledge, the JVET work on the proposed SEI-based
GFVC approach is the first standardization activity for generative video
compression. The proposed SEI approach has not only advanced the reconstruction
quality of early-day Model-Based Coding (MBC) via the state-of-the-art
generative technique, but also established a new SEI definition for future GFVC
applications and deployment. Experimental results illustrate that the proposed
SEI-based GFVC approach can achieve remarkable rate-distortion performance
compared with the latest Versatile Video Coding (VVC) standard, whilst also
potentially enabling a wide variety of functionalities including user-specified
animation/filtering and metaverse-related applications.
♻ ☆ ArtAug: Enhancing Text-to-Image Generation through Synthesis-Understanding Interaction
The emergence of diffusion models has significantly advanced image synthesis.
The recent studies of model interaction and self-corrective reasoning approach
in large language models offer new insights for enhancing text-to-image models.
Inspired by these studies, we propose a novel method called ArtAug for
enhancing text-to-image models in this paper. To the best of our knowledge,
ArtAug is the first one that improves image synthesis models via model
interactions with understanding models. In the interactions, we leverage human
preferences implicitly learned by image understanding models to provide
fine-grained suggestions for image synthesis models. The interactions can
modify the image content to make it aesthetically pleasing, such as adjusting
exposure, changing shooting angles, and adding atmospheric effects. The
enhancements brought by the interaction are iteratively fused into the
synthesis model itself through an additional enhancement module. This enables
the synthesis model to directly produce aesthetically pleasing images without
any extra computational cost. In the experiments, we train the ArtAug
enhancement module on existing text-to-image models. Various evaluation metrics
consistently demonstrate that ArtAug enhances the generative capabilities of
text-to-image models without incurring additional computational costs. The
source code and models will be released publicly.
comment: 18 pages, 8 figures
♻ ☆ A Hitchhiker's Guide to Understanding Performances of Two-Class Classifiers
Properly understanding the performances of classifiers is essential in
various scenarios. However, the literature often relies only on one or two
standard scores to compare classifiers, which fails to capture the nuances of
application-specific requirements, potentially leading to suboptimal classifier
selection. Recently, a paper on the foundations of the theory of
performance-based ranking introduced a tool, called the Tile, that organizes an
infinity of ranking scores into a 2D map. Thanks to the Tile, it is now
possible to evaluate and compare classifiers efficiently, displaying all
possible application-specific preferences instead of having to rely on a pair
of scores. In this paper, we provide a first hitchhiker's guide for
understanding the performances of two-class classifiers by presenting four
scenarios, each showcasing a different user profile: a theoretical analyst, a
method designer, a benchmarker, and an application developer. Particularly, we
show that we can provide different interpretative flavors that are adapted to
the user's needs by mapping different values on the Tile. As an illustration,
we leverage the newly introduced Tile tool and the different flavors to rank
and analyze the performances of 74 state-of-the-art semantic segmentation
models in two-class classification through the eyes of the four user profiles.
Through these user profiles, we demonstrate that the Tile effectively captures
the behavior of classifiers in a single visualization, while accommodating an
infinite number of ranking scores.
♻ ☆ The Tile: A 2D Map of Ranking Scores for Two-Class Classification
In the computer vision and machine learning communities, as well as in many
other research domains, rigorous evaluation of any new method, including
classifiers, is essential. One key component of the evaluation process is the
ability to compare and rank methods. However, ranking classifiers and
accurately comparing their performances, especially when taking
application-specific preferences into account, remains challenging. For
instance, commonly used evaluation tools like Receiver Operating Characteristic
(ROC) and Precision/Recall (PR) spaces display performances based on two
scores. Hence, they are inherently limited in their ability to compare
classifiers across a broader range of scores and lack the capability to
establish a clear ranking among classifiers. In this paper, we present a novel
versatile tool, named the Tile, that organizes an infinity of ranking scores in
a single 2D map for two-class classifiers, including common evaluation scores
such as the accuracy, the true positive rate, the positive predictive value,
Jaccard's coefficient, and all F-beta scores. Furthermore, we study the
properties of the underlying ranking scores, such as the influence of the
priors or the correspondences with the ROC space, and depict how to
characterize any other score by comparing them to the Tile. Overall, we
demonstrate that the Tile is a powerful tool that effectively captures all the
rankings in a single visualization and allows interpreting them.
♻ ☆ Foundations of the Theory of Performance-Based Ranking
Ranking entities such as algorithms, devices, methods, or models based on
their performances, while accounting for application-specific preferences, is a
challenge. To address this challenge, we establish the foundations of a
universal theory for performance-based ranking. First, we introduce a rigorous
framework built on top of both the probability and order theories. Our new
framework encompasses the elements necessary to (1) manipulate performances as
mathematical objects, (2) express which performances are worse than or
equivalent to others, (3) model tasks through a variable called satisfaction,
(4) consider properties of the evaluation, (5) define scores, and (6) specify
application-specific preferences through a variable called importance. On top
of this framework, we propose the first axiomatic definition of performance
orderings and performance-based rankings. Then, we introduce a universal
parametric family of scores, called ranking scores, that can be used to
establish rankings satisfying our axioms, while considering
application-specific preferences. Finally, we show, in the case of two-class
classification, that the family of ranking scores encompasses well-known
performance scores, including the accuracy, the true positive rate (recall,
sensitivity), the true negative rate (specificity), the positive predictive
value (precision), and F1. However, we also show that some other scores
commonly used to compare classifiers are unsuitable to derive performance
orderings satisfying the axioms. Therefore, this paper provides the computer
vision and machine learning communities with a rigorous framework for
evaluating and ranking entities.
♻ ☆ Photoacoustic Iterative Optimization Algorithm with Shape Prior Regularization
Photoacoustic imaging (PAI) suffers from inherent limitations that can
degrade the quality of reconstructed results, such as noise, artifacts and
incomplete data acquisition caused by sparse sampling or partial array
detection. In this study, we proposed a new optimization method for both
two-dimensional (2D) and three-dimensional (3D) PAI reconstruction results,
called the regularized iteration method with shape prior. The shape prior is a
probability matrix derived from the reconstruction results of multiple sets of
random partial array signals in a computational imaging system using any
reconstruction algorithm, such as Delay-and-Sum (DAS) and Back-Projection (BP).
In the probability matrix, high-probability locations indicate high consistency
among multiple reconstruction results at those positions, suggesting a high
likelihood of representing the true imaging results. In contrast,
low-probability locations indicate higher randomness, leaning more towards
noise or artifacts. As a shape prior, this probability matrix guides the
iteration and regularization of the entire array signal reconstruction results
using the original reconstruction algorithm (the same algorithm for processing
random partial array signals). The method takes advantage of the property that
the similarity of the object to be imitated is higher than that of noise or
artifact in the results reconstructed by multiple sets of random partial array
signals of the entire imaging system. The probability matrix is taken as a
prerequisite for improving the original reconstruction results, and the
optimizer is used to further iterate the imaging results to remove noise and
artifacts and improve the imaging fidelity. Especially in the case involving
sparse view which brings more artifacts, the effect is remarkable. Simulation
and real experiments have both demonstrated the superiority of this method.
♻ ☆ A2RNet: Adversarial Attack Resilient Network for Robust Infrared and Visible Image Fusion
Jiawei Li, Hongwei Yu, Jiansheng Chen, Xinlong Ding, Jinlong Wang, Jinyuan Liu, Bochao Zou, Huimin Ma
Infrared and visible image fusion (IVIF) is a crucial technique for enhancing
visual performance by integrating unique information from different modalities
into one fused image. Exiting methods pay more attention to conducting fusion
with undisturbed data, while overlooking the impact of deliberate interference
on the effectiveness of fusion results. To investigate the robustness of fusion
models, in this paper, we propose a novel adversarial attack resilient network,
called $\textrm{A}^{\textrm{2}}$RNet. Specifically, we develop an adversarial
paradigm with an anti-attack loss function to implement adversarial attacks and
training. It is constructed based on the intrinsic nature of IVIF and provide a
robust foundation for future research advancements. We adopt a Unet as the
pipeline with a transformer-based defensive refinement module (DRM) under this
paradigm, which guarantees fused image quality in a robust coarse-to-fine
manner. Compared to previous works, our method mitigates the adverse effects of
adversarial perturbations, consistently maintaining high-fidelity fusion
results. Furthermore, the performance of downstream tasks can also be well
maintained under adversarial attacks. Code is available at
https://github.com/lok-18/A2RNet.
comment: 9 pages, 8 figures, The 39th Annual AAAI Conference on Artificial
Intelligence
♻ ☆ VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization
We present VQTalker, a Vector Quantization-based framework for multilingual
talking head generation that addresses the challenges of lip synchronization
and natural motion across diverse languages. Our approach is grounded in the
phonetic principle that human speech comprises a finite set of distinct sound
units (phonemes) and corresponding visual articulations (visemes), which often
share commonalities across languages. We introduce a facial motion tokenizer
based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a
discretized representation of facial features. This method enables
comprehensive capture of facial movements while improving generalization to
multiple languages, even with limited training data. Building on this quantized
representation, we implement a coarse-to-fine motion generation process that
progressively refines facial animations. Extensive experiments demonstrate that
VQTalker achieves state-of-the-art performance in both video-driven and
speech-driven scenarios, particularly in multilingual settings. Notably, our
method achieves high-quality results at a resolution of 512*512 pixels while
maintaining a lower bitrate of approximately 11 kbps. Our work opens new
possibilities for cross-lingual talking face generation. Synthetic results can
be viewed at https://x-lance.github.io/VQTalker.
comment: 14 pages
♻ ☆ HaSPeR: An Image Repository for Hand Shadow Puppet Recognition
Hand shadow puppetry, also known as shadowgraphy or ombromanie, is a form of
theatrical art and storytelling where hand shadows are projected onto flat
surfaces to create illusions of living creatures. The skilled performers create
these silhouettes by hand positioning, finger movements, and dexterous gestures
to resemble shadows of animals and objects. Due to the lack of practitioners
and a seismic shift in people's entertainment standards, this art form is on
the verge of extinction. To facilitate its preservation and proliferate it to a
wider audience, we introduce ${\rm H{\small A}SP{\small E}R}$, a novel dataset
consisting of 15,000 images of hand shadow puppets across 15 classes extracted
from both professional and amateur hand shadow puppeteer clips. We provide a
detailed statistical analysis of the dataset and employ a range of pretrained
image classification models to establish baselines. Our findings show a
substantial performance superiority of skip-connected convolutional models over
attention-based transformer architectures. We also find that lightweight
models, such as MobileNetV2, suited for mobile applications and embedded
devices, perform comparatively well. We surmise that such low-latency
architectures can be useful in developing ombromanie teaching tools, and we
create a prototype application to explore this surmission. Keeping the
best-performing model ResNet34 under the limelight, we conduct comprehensive
feature-spatial, explainability, and error analyses to gain insights into its
decision-making process. To the best of our knowledge, this is the first
documented dataset and research endeavor to preserve this dying art for future
generations, with computer vision approaches. Our code and data will be
publicly available.
comment: Submitted to IEEE Transactions on Artificial Intelligence (IEEE TAI),
13 pages, 105 figures, 2 tables
♻ ☆ Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation
In this paper, we propose an efficient, fast, and versatile distillation
method to accelerate the generation of pre-trained diffusion models: Flash
Diffusion. The method reaches state-of-the-art performances in terms of FID and
CLIP-Score for few steps image generation on the COCO2014 and COCO2017
datasets, while requiring only several GPU hours of training and fewer
trainable parameters than existing methods. In addition to its efficiency, the
versatility of the method is also exposed across several tasks such as
text-to-image, inpainting, face-swapping, super-resolution and using different
backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart-$\alpha$),
as well as adapters. In all cases, the method allowed to reduce drastically the
number of sampling steps while maintaining very high-quality image generation.
The official implementation is available at
https://github.com/gojasper/flash-diffusion.
comment: Accepted to AAAI 2025
♻ ☆ Denoising Diffusion Probabilistic Models for Magnetic Resonance Fingerprinting
Magnetic Resonance Fingerprinting (MRF) is a time-efficient approach to
quantitative MRI, enabling the mapping of multiple tissue properties from a
single, accelerated scan. However, achieving accurate reconstructions remains
challenging, particularly in highly accelerated and undersampled acquisitions,
which are crucial for reducing scan times. While deep learning techniques have
advanced image reconstruction, the recent introduction of diffusion models
offers new possibilities for imaging tasks, though their application in the
medical field is still emerging. Notably, diffusion models have not yet been
explored for the MRF problem. In this work, we propose for the first time a
conditional diffusion probabilistic model for MRF image reconstruction.
Qualitative and quantitative comparisons on in-vivo brain scan data demonstrate
that the proposed approach can outperform established deep learning and
compressed sensing algorithms for MRF reconstruction. Extensive ablation
studies also explore strategies to improve computational efficiency of our
approach.
comment: 13 pages, 5 figures, 3 tables, 2 algorithms
♻ ☆ Understanding Key Point Cloud Features for Development Three-dimensional Adversarial Attacks
Adversarial attacks pose serious challenges for deep neural network
(DNN)-based analysis of various input signals. In the case of three-dimensional
point clouds, methods have been developed to identify points that play a key
role in network decision, and these become crucial in generating existing
adversarial attacks. For example, a saliency map approach is a popular method
for identifying adversarial drop points, whose removal would significantly
impact the network decision. This paper seeks to enhance the understanding of
three-dimensional adversarial attacks by exploring which point cloud features
are most important for predicting adversarial points. Specifically, Fourteen
key point cloud features such as edge intensity and distance from the centroid
are defined, and multiple linear regression is employed to assess their
predictive power for adversarial points. Based on critical feature selection
insights, a new attack method has been developed to evaluate whether the
selected features can generate an attack successfully. Unlike traditional
attack methods that rely on model-specific vulnerabilities, this approach
focuses on the intrinsic characteristics of the point clouds themselves. It is
demonstrated that these features can predict adversarial points across four
different DNN architectures, Point Network (PointNet), PointNet++, Dynamic
Graph Convolutional Neural Networks (DGCNN), and Point Convolutional Network
(PointConv) outperforming random guessing and achieving results comparable to
saliency map-based attacks. This study has important engineering applications,
such as enhancing the security and robustness of three-dimensional point
cloud-based systems in fields like robotics and autonomous driving.
comment: 10 pages, 6 figures
♻ ☆ PVP: Polar Representation Boost for 3D Semantic Occupancy Prediction
Recently, polar coordinate-based representations have shown promise for 3D
perceptual tasks. Compared to Cartesian methods, polar grids provide a viable
alternative, offering better detail preservation in nearby spaces while
covering larger areas. However, they face feature distortion due to non-uniform
division. To address these issues, we introduce the Polar Voxel Occupancy
Predictor (PVP), a novel 3D multi-modal predictor that operates in polar
coordinates. PVP features two key design elements to overcome distortion: a
Global Represent Propagation (GRP) module that integrates global spatial data
into 3D volumes, and a Plane Decomposed Convolution (PD-Conv) that simplifies
3D distortions into 2D convolutions. These innovations enable PVP to outperform
existing methods, achieving significant improvements in mIoU and IoU metrics on
the OpenOccupancy dataset.
♻ ☆ Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
In the field of audio-visual learning, most research tasks focus exclusively
on short videos. This paper focuses on the more practical Dense Audio-Visual
Event Localization (DAVEL) task, advancing audio-visual scene understanding for
longer, untrimmed videos. This task seeks to identify and temporally pinpoint
all events simultaneously occurring in both audio and visual streams.
Typically, each video encompasses dense events of multiple classes, which may
overlap on the timeline, each exhibiting varied durations. Given these
challenges, effectively exploiting the audio-visual relations and the temporal
features encoded at various granularities becomes crucial. To address these
challenges, we introduce a novel CCNet, comprising two core modules: the
Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity
Collaboration (MTGC). Specifically, the CMCC module contains two branches: a
cross-modal interaction branch and a temporal consistency-gated branch. The
former branch facilitates the aggregation of consistent event semantics across
modalities through the encoding of audio-visual relations, while the latter
branch guides one modality's focus to pivotal event-relevant temporal areas as
discerned in the other modality. The MTGC module includes a coarse-to-fine
collaboration block and a fine-to-coarse collaboration block, providing
bidirectional support among coarse- and fine-grained temporal features.
Extensive experiments on the UnAV-100 dataset validate our module design,
resulting in a new state-of-the-art performance in dense audio-visual event
localization. The code is available at
https://github.com/zzhhfut/CCNet-AAAI2025.
comment: Accepted by AAAI 2025. Project page:
https://github.com/zzhhfut/CCNet-AAAI2025. Jinxing Zhou and Dan Guo are the
corresponding authors
♻ ☆ VE-Bench: Subjective-Aligned Benchmark Suite for Text-Driven Video Editing Quality Assessment
Text-driven video editing has recently experienced rapid development. Despite
this, evaluating edited videos remains a considerable challenge. Current
metrics tend to fail to align with human perceptions, and effective
quantitative metrics for video editing are still notably absent. To address
this, we introduce VE-Bench, a benchmark suite tailored to the assessment of
text-driven video editing. This suite includes VE-Bench DB, a video quality
assessment (VQA) database for video editing. VE-Bench DB encompasses a diverse
set of source videos featuring various motions and subjects, along with
multiple distinct editing prompts, editing results from 8 different models, and
the corresponding Mean Opinion Scores (MOS) from 24 human annotators. Based on
VE-Bench DB, we further propose VE-Bench QA, a quantitative human-aligned
measurement for the text-driven video editing task. In addition to the
aesthetic, distortion, and other visual quality indicators that traditional VQA
methods emphasize, VE-Bench QA focuses on the text-video alignment and the
relevance modeling between source and edited videos. It proposes a new
assessment network for video editing that attains superior performance in
alignment with human preferences. To the best of our knowledge, VE-Bench
introduces the first quality assessment dataset for video editing and an
effective subjective-aligned quantitative metric for this domain. All data and
code will be publicly available at https://github.com/littlespray/VE-Bench.
comment: Accepted to AAAI 2025
♻ ☆ Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation
Personalized text-to-image generation methods can generate customized images
based on the reference images, which have garnered wide research interest.
Recent methods propose a finetuning-free approach with a decoupled
cross-attention mechanism to generate personalized images requiring no
test-time finetuning. However, when multiple reference images are provided, the
current decoupled cross-attention mechanism encounters the object confusion
problem and fails to map each reference image to its corresponding object,
thereby seriously limiting its scope of application. To address the object
confusion problem, in this work we investigate the relevance of different
positions of the latent image features to the target object in diffusion model,
and accordingly propose a weighted-merge method to merge multiple reference
image features into the corresponding objects. Next, we integrate this
weighted-merge method into existing pre-trained models and continue to train
the model on a multi-object dataset constructed from the open-sourced SA-1B
dataset. To mitigate object confusion and reduce training costs, we propose an
object quality score to estimate the image quality for the selection of
high-quality training samples. Furthermore, our weighted-merge training
framework can be employed on single-object generation when a single object has
multiple reference images. The experiments verify that our method achieves
superior performance to the state-of-the-arts on the Concept101 dataset and
DreamBooth dataset of multi-object personalized image generation, and
remarkably improves the performance on single-object personalized image
generation. Our code is available at https://github.com/hqhQAQ/MIP-Adapter.
♻ ☆ DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
Video large language models (VLLMs) have significantly advanced recently in
processing complex video content, yet their inference efficiency remains
constrained because of the high computational cost stemming from the thousands
of visual tokens generated from the video inputs. We empirically observe that,
unlike single image inputs, VLLMs typically attend visual tokens from different
frames at different decoding iterations, making a one-shot pruning strategy
prone to removing important tokens by mistake. Motivated by this, we present
DyCoke, a training-free token compression method to optimize token
representation and accelerate VLLMs. DyCoke incorporates a plug-and-play
temporal compression module to minimize temporal redundancy by merging
redundant tokens across frames, and applies dynamic KV cache reduction to prune
spatially redundant tokens selectively. It ensures high-quality inference by
dynamically retaining the critical tokens at each decoding step. Extensive
experimental results demonstrate that DyCoke can outperform the prior SoTA
counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against
the baseline VLLM, while still improving the performance, with no training.
comment: 12 pages, 6 figures
♻ ☆ Lifting Scheme-Based Implicit Disentanglement of Emotion-Related Facial Dynamics in the Wild
In-the-wild dynamic facial expression recognition (DFER) encounters a
significant challenge in recognizing emotion-related expressions, which are
often temporally and spatially diluted by emotion-irrelevant expressions and
global context. Most prior DFER methods directly utilize coupled spatiotemporal
representations that may incorporate weakly relevant features with
emotion-irrelevant context bias. Several DFER methods highlight dynamic
information for DFER, but following explicit guidance that may be vulnerable to
irrelevant motion. In this paper, we propose a novel Implicit Facial Dynamics
Disentanglement framework (IFDD). Through expanding wavelet lifting scheme to
fully learnable framework, IFDD disentangles emotion-related dynamic
information from emotion-irrelevant global context in an implicit manner, i.e.,
without exploit operations and external guidance. The disentanglement process
contains two stages. The first is Inter-frame Static-dynamic Splitting Module
(ISSM) for rough disentanglement estimation, which explores inter-frame
correlation to generate content-aware splitting indexes on-the-fly. We utilize
these indexes to split frame features into two groups, one with greater global
similarity, and the other with more unique dynamic features. The second stage
is Lifting-based Aggregation-Disentanglement Module (LADM) for further
refinement. LADM first aggregates two groups of features from ISSM to obtain
fine-grained global context features by an updater, and then disentangles
emotion-related facial dynamic features from the global context by a predictor.
Extensive experiments on in-the-wild datasets have demonstrated that IFDD
outperforms prior supervised DFER methods with higher recognition accuracy and
comparable efficiency. Code is available at
https://github.com/CyberPegasus/IFDD.
comment: 14 pages, 5 figures
♻ ☆ Demystify Transformers & Convolutions in Modern Image Deep Networks
Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie Zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai
Vision transformers have gained popularity recently, leading to the
development of new vision backbones with improved features and consistent
performance gains. However, these advancements are not solely attributable to
novel feature transformation designs; certain benefits also arise from advanced
network-level and block-level architectures. This paper aims to identify the
real gains of popular convolution and attention operators through a detailed
study. We find that the key difference among these feature transformation
modules, such as attention or convolution, lies in their spatial feature
aggregation approach, known as the "spatial token mixer" (STM). To facilitate
an impartial comparison, we introduce a unified architecture to neutralize the
impact of divergent network-level and block-level designs. Subsequently,
various STMs are integrated into this unified framework for comprehensive
comparative analysis. Our experiments on various tasks and an analysis of
inductive bias show a significant performance boost due to advanced
network-level and block-level designs, but performance differences persist
among different STMs. Our detailed analysis also reveals various findings about
different STMs, including effective receptive fields, invariance, and
adversarial robustness tests.
comment: This paper was accepted to IEEE Transactions on Pattern Analysis and
Machine Intelligence (IEEE TPAMI). All models and codes used in this study
are publicly available at https://github.com/OpenGVLab/STM-Evaluation
♻ ☆ GN-FR:Generalizable Neural Radiance Fields for Flare Removal
Flare, an optical phenomenon resulting from unwanted scattering and
reflections within a lens system, presents a significant challenge in imaging.
The diverse patterns of flares, such as halos, streaks, color bleeding, and
haze, complicate the flare removal process. Existing traditional and
learning-based methods have exhibited limited efficacy due to their reliance on
single-image approaches, where flare removal is highly ill-posed. We address
this by framing flare removal as a multi-view image problem, taking advantage
of the view-dependent nature of flare artifacts. This approach leverages
information from neighboring views to recover details obscured by flare in
individual images. Our proposed framework, GN-FR (Generalizable Neural Radiance
Fields for Flare Removal), can render flare-free views from a sparse set of
input images affected by lens flare and generalizes across different scenes in
an unsupervised manner. GN-FR incorporates several modules within the
Generalizable NeRF Transformer (GNT) framework: Flare-occupancy Mask Generation
(FMG), View Sampler (VS), and Point Sampler (PS). To overcome the
impracticality of capturing both flare-corrupted and flare-free data, we
introduce a masking loss function that utilizes mask information in an
unsupervised setting. Additionally, we present a 3D multi-view flare dataset,
comprising 17 real flare scenes with 782 images, 80 real flare patterns, and
their corresponding annotated flare-occupancy masks. To our knowledge, this is
the first work to address flare removal within a Neural Radiance Fields (NeRF)
framework.
comment: Accepted for publication at BMVC-24
♻ ☆ Semantics-Aware Next-best-view Planning for Efficient Search and Detection of Task-relevant Plant Parts
Akshay K. Burusa, Joost Scholten, David Rapado Rincon, Xin Wang, Eldert J. van Henten, Gert Kootstra
Searching and detecting the task-relevant parts of plants is important to
automate harvesting and de-leafing of tomato plants using robots. This is
challenging due to high levels of occlusion in tomato plants. Active vision is
a promising approach in which the robot strategically plans its camera
viewpoints to overcome occlusion and improve perception accuracy. However,
current active-vision algorithms cannot differentiate between relevant and
irrelevant plant parts and spend time on perceiving irrelevant plant parts.
This work proposed a semantics-aware active-vision strategy that uses semantic
information to identify the relevant plant parts and prioritise them during
view planning. The proposed strategy was evaluated on the task of searching and
detecting the relevant plant parts using simulation and real-world experiments.
In simulation experiments, the semantics-aware strategy proposed could search
and detect 81.8% of the relevant plant parts using nine viewpoints. It was
significantly faster and detected more plant parts than predefined, random, and
volumetric active-vision strategies that do not use semantic information. The
strategy proposed was also robust to uncertainty in plant and plant-part
positions, plant complexity, and different viewpoint-sampling strategies. In
real-world experiments, the semantics-aware strategy could search and detect
82.7% of the relevant plant parts using seven viewpoints, under complex
greenhouse conditions with natural variation and occlusion, natural
illumination, sensor noise, and uncertainty in camera poses. The results of
this work clearly indicate the advantage of using semantics-aware active vision
for targeted perception of plant parts and its applicability in the real world.
It can significantly improve the efficiency of automated harvesting and
de-leafing in tomato crop production.
♻ ☆ MSP-MVS: Multi-Granularity Segmentation Prior Guided Multi-View Stereo
Recently, patch deformation-based methods have demonstrated significant
strength in multi-view stereo by adaptively expanding the reception field of
patches to help reconstruct textureless areas. However, such methods mainly
concentrate on searching for pixels without matching ambiguity (i.e., reliable
pixels) when constructing deformed patches, while neglecting the deformation
instability caused by unexpected edge-skipping, resulting in potential matching
distortions. Addressing this, we propose MSP-MVS, a method introducing
multi-granularity segmentation prior for edge-confined patch deformation.
Specifically, to avoid unexpected edge-skipping, we first aggregate and further
refine multi-granularity depth edges gained from Semantic-SAM as prior to guide
patch deformation within depth-continuous (i.e., homogeneous) areas. Moreover,
to address attention imbalance caused by edge-confined patch deformation, we
implement adaptive equidistribution and disassemble-clustering of correlative
reliable pixels (i.e., anchors), thereby promoting attention-consistent patch
deformation. Finally, to prevent deformed patches from falling into
local-minimum matching costs caused by the fixed sampling pattern, we introduce
disparity-sampling synergistic 3D optimization to help identify global-minimum
matching costs. Evaluations on ETH3D and Tanks & Temples benchmarks prove our
method obtains state-of-the-art performance with remarkable generalization.
♻ ☆ SARATR-X: Towards Building A Foundation Model for SAR Target Recognition
Despite the remarkable progress in synthetic aperture radar automatic target
recognition (SAR ATR), recent efforts have concentrated on detecting and
classifying a specific category, e.g., vehicles, ships, airplanes, or
buildings. One of the fundamental limitations of the top-performing SAR ATR
methods is that the learning paradigm is supervised, task-specific,
limited-category, closed-world learning, which depends on massive amounts of
accurately annotated samples that are expensively labeled by expert SAR
analysts and have limited generalization capability and scalability. In this
work, we make the first attempt towards building a foundation model for SAR
ATR, termed SARATR-X. SARATR-X learns generalizable representations via
self-supervised learning (SSL) and provides a cornerstone for label-efficient
model adaptation to generic SAR target detection and classification tasks.
Specifically, SARATR-X is trained on 0.18 M unlabelled SAR target samples,
which are curated by combining contemporary benchmarks and constitute the
largest publicly available dataset till now. Considering the characteristics of
SAR images, a backbone tailored for SAR ATR is carefully designed, and a
two-step SSL method endowed with multi-scale gradient features was applied to
ensure the feature diversity and model scalability of SARATR-X. The
capabilities of SARATR-X are evaluated on classification under few-shot and
robustness settings and detection across various categories and scenes, and
impressive performance is achieved, often competitive with or even superior to
prior fully supervised, semi-supervised, or self-supervised algorithms. Our
SARATR-X and the curated dataset are released at
https://github.com/waterdisappear/SARATR-X to foster research into foundation
models for SAR image interpretation.
comment: 20 pages, 9 figures
♻ ☆ Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models
Julius Pesonen, Teemu Hakala, Väinö Karjalainen, Niko Koivumäki, Lauri Markelin, Anna-Maria Raita-Hakola, Juha Suomalainen, Ilkka Pölönen, Eija Honkavaara
Early detection of wildfires is essential to prevent large-scale fires
resulting in extensive environmental, structural, and societal damage. Uncrewed
aerial vehicles (UAVs) can cover large remote areas effectively with quick
deployment requiring minimal infrastructure and equipping them with small
cameras and computers enables autonomous real-time detection. In remote areas,
however, detection methods are limited to onboard computation due to the lack
of high-bandwidth mobile networks. For accurate camera-based localisation,
segmentation of the detected smoke is essential but training data for deep
learning-based wildfire smoke segmentation is limited. This study shows how
small specialised segmentation models can be trained using only bounding box
labels, leveraging zero-shot foundation model supervision. The method offers
the advantages of needing only fairly easily obtainable bounding box labels and
requiring training solely for the smaller student network. The proposed method
achieved 63.3% mIoU on a manually annotated and diverse wildfire dataset. The
used model can perform in real-time at ~25 fps with a UAV-carried NVIDIA Jetson
Orin NX computer while reliably recognising smoke, as demonstrated at
real-world forest burning events. Code is available at:
https://gitlab.com/fgi_nls/public/wildfire-real-time-segmentation
♻ ☆ QCS:Feature Refining from Quadruplet Cross Similarity for Facial Expression Recognition
Facial expression recognition faces challenges where labeled significant
features in datasets are mixed with unlabeled redundant ones. In this paper, we
introduce Cross Similarity Attention (CSA) to mine richer intrinsic information
from image pairs, overcoming a limitation when the Scaled Dot-Product Attention
of ViT is directly applied to calculate the similarity between two different
images. Based on CSA, we simultaneously minimize intra-class differences and
maximize inter-class differences at the fine-grained feature level through
interactions among multiple branches. Contrastive residual distillation is
utilized to transfer the information learned in the cross module back to the
base network. We ingeniously design a four-branch centrally symmetric network,
named Quadruplet Cross Similarity (QCS), which alleviates gradient conflicts
arising from the cross module and achieves balanced and stable training. It can
adaptively extract discriminative features while isolating redundant ones. The
cross-attention modules exist during training, and only one base branch is
retained during inference, resulting in no increase in inference time. Our
proposed method achieves state-of-the-art performance on several FER datasets.
♻ ☆ Efficient Transfer Learning for Video-language Foundation Models
Pre-trained vision-language models provide a robust foundation for efficient
transfer learning across various downstream tasks. In the field of video action
recognition, mainstream approaches often introduce additional parameter modules
to capture temporal information. While the increased model capacity brought by
these additional parameters helps better fit the video-specific inductive
biases, existing methods require learning a large number of parameters and are
prone to catastrophic forgetting of the original generalizable knowledge. In
this paper, we propose a simple yet effective Multi-modal Spatio-Temporal
Adapter (MSTA) to improve the alignment between representations in the text and
vision branches, achieving a balance between general knowledge and
task-specific knowledge. Furthermore, to mitigate over-fitting and enhance
generalizability, we introduce a spatio-temporal description-guided consistency
constraint. This constraint involves feeding template inputs (i.e., ``a video
of $\{\textbf{cls}\}$'') into the trainable language branch, while
LLM-generated spatio-temporal descriptions are input into the pre-trained
language branch, enforcing consistency between the outputs of the two branches.
This mechanism prevents over-fitting to downstream tasks and improves the
distinguishability of the trainable branch within the spatio-temporal semantic
space. We evaluate the effectiveness of our approach across four tasks:
zero-shot transfer, few-shot learning, base-to-novel generalization, and
fully-supervised learning. Compared to many state-of-the-art methods, our MSTA
achieves outstanding performance across all evaluations, while using only 2-7\%
of the trainable parameters in the original model. Code will be avaliable at
https://github.com/chenhaoxing/ETL4Video.
♻ ☆ REVECA: Adaptive Planning and Trajectory-based Validation in Cooperative Language Agents using Information Relevance and Relative Proximity
We address the challenge of multi-agent cooperation, where agents achieve a
common goal by cooperating with decentralized agents under complex partial
observations. Existing cooperative agent systems often struggle with
efficiently processing continuously accumulating information, managing globally
suboptimal planning due to lack of consideration of collaborators, and
addressing false planning caused by environmental changes introduced by other
collaborators. To overcome these challenges, we propose the RElevance,
Proximity, and Validation-Enhanced Cooperative Language Agent (REVECA), a novel
cognitive architecture powered by GPT-4o-mini. REVECA enables efficient memory
management, optimal planning, and cost-effective prevention of false planning
by leveraging Relevance Estimation, Adaptive Planning, and Trajectory-based
Validation. Extensive experimental results demonstrate REVECA's superiority
over existing methods across various benchmarks, while a user study reveals its
potential for achieving trustworthy human-AI cooperation.
comment: v2 is the AAAI'25 camera-ready version, including the appendix, which
has been enhanced based on the reviewers' comments
♻ ☆ Idea23D: Collaborative LMM Agents Enable 3D Model Generation from Interleaved Multimodal Inputs
With the success of 2D diffusion models, 2D AIGC content has already
transformed our lives. Recently, this success has been extended to 3D AIGC,
with state-of-the-art methods generating textured 3D models from single images
or text. However, we argue that current 3D AIGC methods still do not fully
unleash human creativity. We often imagine 3D content made from multimodal
inputs, such as what it would look like if my pet bunny were eating a doughnut
on the table. In this paper, we explore a novel 3D AIGC approach: generating 3D
content from IDEAs. An IDEA is a multimodal input composed of text, image, and
3D models. To our knowledge, this challenging and exciting 3D AIGC setting has
not been studied before. We propose the new framework Idea23D, which combines
three agents based on large multimodal models (LMMs) and existing algorithmic
tools. These three LMM-based agents are tasked with prompt generation, model
selection, and feedback reflection. They collaborate and critique each other in
a fully automated loop, without human intervention. The framework then
generates a text prompt to create 3D models that align closely with the input
IDEAs. We demonstrate impressive 3D AIGC results that surpass previous methods.
To comprehensively assess the 3D AIGC capabilities of Idea23D, we introduce the
Eval3DAIGC-198 dataset, containing 198 multimodal inputs for 3D generation
tasks. This dataset evaluates the alignment between generated 3D content and
input IDEAs. Our user study and quantitative results show that Idea23D
significantly improves the success rate and accuracy of 3D generation, with
excellent compatibility across various LMM, Text-to-Image, and Image-to-3D
models. Code and dataset are available at \url{https://idea23d.github.io/}.
comment: Accepted by COLING 2025 (The 31st International Conference on
Computational Linguistics) Project Page: https://idea23d.github.io/ Code:
https://github.com/yisuanwang/Idea23D
♻ ☆ Diffusion Model from Scratch
Diffusion generative models are currently the most popular generative models.
However, their underlying modeling process is quite complex, and starting
directly with the seminal paper Denoising Diffusion Probability Model (DDPM)
can be challenging. This paper aims to assist readers in building a
foundational understanding of generative models by tracing the evolution from
VAEs to DDPM through detailed mathematical derivations and a problem-oriented
analytical approach. It also explores the core ideas and improvement strategies
of current mainstream methodologies, providing guidance for undergraduate and
graduate students interested in learning about diffusion models.
comment: There were problems with the typography of our illustrations, and
there were problems with the derivation of the 200-step formula
♻ ☆ Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval
Yuanmin Tang, Xiaoting Qin, Jue Zhang, Jing Yu, Gaopeng Gou, Gang Xiong, Qingwei Ling, Saravan Rajmohan, Dongmei Zhang, Qi Wu
Composed Image Retrieval (CIR) aims to retrieve target images that closely
resemble a reference image while integrating user-specified textual
modifications, thereby capturing user intent more precisely. Existing
training-free zero-shot CIR (ZS-CIR) methods often employ a two-stage process:
they first generate a caption for the reference image and then use Large
Language Models for reasoning to obtain a target description. However, these
methods suffer from missing critical visual details and limited reasoning
capabilities, leading to suboptimal retrieval performance. To address these
challenges, we propose a novel, training-free one-stage method, One-Stage
Reflective Chain-of-Thought Reasoning for ZS-CIR (OSrCIR), which employs
Multimodal Large Language Models to retain essential visual information in a
single-stage reasoning process, eliminating the information loss seen in
two-stage methods. Our Reflective Chain-of-Thought framework further improves
interpretative accuracy by aligning manipulation intent with contextual cues
from reference images. OSrCIR achieves performance gains of 1.80% to 6.44% over
existing training-free methods across multiple tasks, setting new
state-of-the-art results in ZS-CIR and enhancing its utility in vision-language
applications. Our code will be available at
https://github.com/Pter61/osrcir2024/.
♻ ☆ Attentive Eraser: Unleashing Diffusion Model's Object Removal Potential via Self-Attention Redirection Guidance
Recently, diffusion models have emerged as promising newcomers in the field
of generative models, shining brightly in image generation. However, when
employed for object removal tasks, they still encounter issues such as
generating random artifacts and the incapacity to repaint foreground object
areas with appropriate content after removal. To tackle these problems, we
propose Attentive Eraser, a tuning-free method to empower pre-trained diffusion
models for stable and effective object removal. Firstly, in light of the
observation that the self-attention maps influence the structure and shape
details of the generated images, we propose Attention Activation and
Suppression (ASS), which re-engineers the self-attention mechanism within the
pre-trained diffusion models based on the given mask, thereby prioritizing the
background over the foreground object during the reverse generation process.
Moreover, we introduce Self-Attention Redirection Guidance (SARG), which
utilizes the self-attention redirected by ASS to guide the generation process,
effectively removing foreground objects within the mask while simultaneously
generating content that is both plausible and coherent. Experiments demonstrate
the stability and effectiveness of Attentive Eraser in object removal across a
variety of pre-trained diffusion models, outperforming even training-based
methods. Furthermore, Attentive Eraser can be implemented in various diffusion
model architectures and checkpoints, enabling excellent scalability. Code is
available at https://github.com/Anonym0u3/AttentiveEraser.
comment: Accepted by AAAI 2025
♻ ☆ ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
The efficiency of large vision-language models (LVLMs) is constrained by the
computational bottleneck of the attention mechanism during the prefill phase
and the memory bottleneck of fetching the key-value (KV) cache in the decoding
phase, particularly in scenarios involving high-resolution images or videos.
Visual content often exhibits substantial redundancy, resulting in highly
sparse attention maps within LVLMs. This sparsity can be leveraged to
accelerate attention computation or compress the KV cache through various
approaches. However, most studies focus on addressing only one of these
bottlenecks and do not adequately support dynamic adjustment of sparsity
concerning distinct layers or tasks. In this paper, we present ZipVL, an
efficient inference framework designed for LVLMs through a dynamic ratio
allocation strategy of important tokens. This ratio is adaptively determined
based on the layer-specific distribution of attention scores, rather than fixed
hyper-parameters, thereby improving efficiency for less complex tasks while
maintaining high performance for more challenging ones. Then we select
important tokens based on their normalized attention scores and perform sparse
attention mechanism solely on those important tokens, reducing the latency in
the prefill phase. Tokens deemed less important will be discarded to reduce KV
cache size, alleviating the memory bottleneck in the decoding phase. Our
experiments demonstrate that ZipVL can accelerate the prefill phase by
2.3$\times$ and improve decoding throughput by 2.8$\times$, with a minimal
accuracy reduction of only 0.5\% on VQAv2 benchmark over LLaVA-Next-13B model,
effectively enhancing the generation efficiency of LVLMs.
comment: 13 pages
♻ ☆ Leveraging Semantic Asymmetry for Precise Gross Tumor Volume Segmentation of Nasopharyngeal Carcinoma in Planning CT
Zi Li, Ying Chen, Zeli Chen, Yanzhou Su, Tai Ma, Tony C. W. Mok, Yan-Jie Zhou, Yunhai Bai, Zhinlin Zheng, Le Lu, Yirui Wang, Jia Ge, Xianghua Ye, Senxiang Yan, Dakai Jin
In the radiation therapy of nasopharyngeal carcinoma (NPC), clinicians
typically delineate the gross tumor volume (GTV) using non-contrast planning
computed tomography to ensure accurate radiation dose delivery. However, the
low contrast between tumors and adjacent normal tissues necessitates that
radiation oncologists manually delineate the tumors, often relying on
diagnostic MRI for guidance. % In this study, we propose a novel approach to
directly segment NPC gross tumors on non-contrast planning CT images,
circumventing potential registration errors when aligning MRI or MRI-derived
tumor masks to planning CT. To address the low contrast issues between tumors
and adjacent normal structures in planning CT, we introduce a 3D Semantic
Asymmetry Tumor segmentation (SATs) method. Specifically, we posit that a
healthy nasopharyngeal region is characteristically bilaterally symmetric,
whereas the emergence of nasopharyngeal carcinoma disrupts this symmetry. Then,
we propose a Siamese contrastive learning segmentation framework that minimizes
the voxel-wise distance between original and flipped areas without tumor and
encourages a larger distance between original and flipped areas with tumor.
Thus, our approach enhances the sensitivity of features to semantic
asymmetries. % Extensive experiments demonstrate that the proposed SATs
achieves the leading NPC GTV segmentation performance in both internal and
external testing, \emph{e.g.}, with at least 2\% absolute Dice score
improvement and 12\% average distance error reduction when compared to other
state-of-the-art methods in the external testing.
♻ ☆ ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality
In this paper, we propose ZipAR, a training-free, plug-and-play parallel
decoding framework for accelerating auto-regressive (AR) visual generation. The
motivation stems from the observation that images exhibit local structures, and
spatially distant regions tend to have minimal interdependence. Given a
partially decoded set of visual tokens, in addition to the original next-token
prediction scheme in the row dimension, the tokens corresponding to spatially
adjacent regions in the column dimension can be decoded in parallel, enabling
the ``next-set prediction'' paradigm. By decoding multiple tokens
simultaneously in a single forward pass, the number of forward passes required
to generate an image is significantly reduced, resulting in a substantial
improvement in generation efficiency. Experiments demonstrate that ZipAR can
reduce the number of model forward passes by up to 91% on the Emu3-Gen model
without requiring any additional retraining. Code is available here:
https://github.com/ThisisBillhe/ZipAR.
comment: 11 pages
♻ ☆ Optimized Gradient Clipping for Noisy Label Learning
Previous research has shown that constraining the gradient of loss function
with respect to model-predicted probabilities can enhance the model robustness
against noisy labels. These methods typically specify a fixed optimal threshold
for gradient clipping through validation data to obtain the desired robustness
against noise. However, this common practice overlooks the dynamic distribution
of gradients from both clean and noisy-labeled samples at different stages of
training, significantly limiting the model capability to adapt to the variable
nature of gradients throughout the training process. To address this issue, we
propose a simple yet effective approach called Optimized Gradient Clipping
(OGC), which dynamically adjusts the clipping threshold based on the ratio of
noise gradients to clean gradients after clipping, estimated by modeling the
distributions of clean and noisy samples. This approach allows us to modify the
clipping threshold at each training step, effectively controlling the influence
of noise gradients. Additionally, we provide statistical analysis to certify
the noise-tolerance ability of OGC. Our extensive experiments across various
types of label noise, including symmetric, asymmetric, instance-dependent, and
real-world noise, demonstrate the effectiveness of our approach.
comment: Accepted by AAAI2025
♻ ☆ Are Vision xLSTM Embedded UNet More Reliable in Medical 3D Image Segmentation?
The development of efficient segmentation strategies for medical images has
evolved from its initial dependence on Convolutional Neural Networks (CNNs) to
the current investigation of hybrid models that combine CNNs with Vision
Transformers. There is an increasing focus on creating architectures that are
both high-performance and computationally efficient, able to be deployed on
remote systems with limited resources. Although transformers can capture global
dependencies in the input space, they face challenges from the corresponding
high computational and storage expenses involved. This paper investigates the
integration of CNNs with Vision Extended Long Short-Term Memory (Vision-xLSTM)s
by introducing the novel {\it \textbf{U-VixLSTM}}.
The Vision-xLSTM blocks capture temporal and global relationships within the
patches, as extracted from the CNN feature maps. The convolutional feature
reconstruction path upsamples the output volume from the Vision-xLSTM blocks,
to produce the segmentation output. Our primary objective is to propose that
Vision-xLSTM forms an appropriate backbone for medical image segmentation,
offering excellent performance with reduced computational costs. The U-VixLSTM
exhibits superior performance, compared to the state-of-the-art networks in the
publicly available Synapse, ISIC and ACDC datasets. Code provided:
https://github.com/duttapallabi2907/U-VixLSTM
♻ ☆ ManipGPT: Is Affordance Segmentation by Large Vision Models Enough for Articulated Object Manipulation?
Visual actionable affordance has emerged as a transformative approach in
robotics, focusing on perceiving interaction areas prior to manipulation.
Traditional methods rely on pixel sampling to identify successful interaction
samples or processing pointclouds for affordance mapping. However, these
approaches are computationally intensive and struggle to adapt to diverse and
dynamic environments. This paper introduces ManipGPT, a framework designed to
predict optimal interaction areas for articulated objects using a large
pre-trained vision transformer (ViT). We created a dataset of 9.9k simulated
and real images to bridge the sim-to-real gap and enhance real-world
applicability. By fine-tuning the vision transformer on this small dataset, we
significantly improved part-level affordance segmentation, adapting the model's
in-context segmentation capabilities to robot manipulation scenarios. This
enables effective manipulation across simulated and real-world environments by
generating part-level affordance masks, paired with an impedance adaptation
policy, sufficiently eliminating the need for complex datasets or perception
systems.
comment: 8 pages, 6 figures
♻ ☆ Open-Vocabulary High-Resolution 3D (OVHR3D) Data Segmentation and Annotation Framework
In the domain of the U.S. Army modeling and simulation, the availability of
high quality annotated 3D data is pivotal to creating virtual environments for
training and simulations. Traditional methodologies for 3D semantic and
instance segmentation, such as KpConv, RandLA, Mask3D, etc., are designed to
train on extensive labeled datasets to obtain satisfactory performance in
practical tasks. This requirement presents a significant challenge, given the
inherent scarcity of manually annotated 3D datasets, particularly for the
military use cases. Recognizing this gap, our previous research leverages the
One World Terrain data repository manually annotated databases, as showcased at
IITSEC 2019 and 2021, to enrich the training dataset for deep learning models.
However, collecting and annotating large scale 3D data for specific tasks
remains costly and inefficient. To this end, the objective of this research is
to design and develop a comprehensive and efficient framework for 3D
segmentation tasks to assist in 3D data annotation. This framework integrates
Grounding DINO and Segment anything Model, augmented by an enhancement in 2D
image rendering via 3D mesh. Furthermore, the authors have also developed a
user friendly interface that facilitates the 3D annotation process, offering
intuitive visualization of rendered images and the 3D point cloud.
♻ ☆ Rethinking Multi-domain Generalization with A General Learning Objective
Multi-domain generalization (mDG) is universally aimed to minimize the
discrepancy between training and testing distributions to enhance
marginal-to-label distribution mapping. However, existing mDG literature lacks
a general learning objective paradigm and often imposes constraints on static
target marginal distributions. In this paper, we propose to leverage a
$Y$-mapping to relax the constraint. We rethink the learning objective for mDG
and design a new \textbf{general learning objective} to interpret and analyze
most existing mDG wisdom. This general objective is bifurcated into two
synergistic amis: learning domain-independent conditional features and
maximizing a posterior. Explorations also extend to two effective
regularization terms that incorporate prior information and suppress invalid
causality, alleviating the issues that come with relaxed constraints. We
theoretically contribute an upper bound for the domain alignment of
domain-independent conditional features, disclosing that many previous mDG
endeavors actually \textbf{optimize partially the objective} and thus lead to
limited performance. As such, our study distills a general learning objective
into four practical components, providing a general, robust, and flexible
mechanism to handle complex domain shifts. Extensive empirical results indicate
that the proposed objective with $Y$-mapping leads to substantially better mDG
performance in various downstream tasks, including regression, segmentation,
and classification.
comment: Accepted by CVPR24
♻ ☆ From Optimization to Generalization: Fair Federated Learning against Quality Shift via Inter-Client Sharpness Matching
Due to escalating privacy concerns, federated learning has been recognized as
a vital approach for training deep neural networks with decentralized medical
data. In practice, it is challenging to ensure consistent imaging quality
across various institutions, often attributed to equipment malfunctions
affecting a minority of clients. This imbalance in image quality can cause the
federated model to develop an inherent bias towards higher-quality images, thus
posing a severe fairness issue. In this study, we pioneer the identification
and formulation of this new fairness challenge within the context of the
imaging quality shift. Traditional methods for promoting fairness in federated
learning predominantly focus on balancing empirical risks across diverse client
distributions. This strategy primarily facilitates fair optimization across
different training data distributions, yet neglects the crucial aspect of
generalization. To address this, we introduce a solution termed Federated
learning with Inter-client Sharpness Matching (FedISM). FedISM enhances both
local training and global aggregation by incorporating sharpness-awareness,
aiming to harmonize the sharpness levels across clients for fair
generalization. Our empirical evaluations, conducted using the widely-used ICH
and ISIC 2019 datasets, establish FedISM's superiority over current
state-of-the-art federated learning methods in promoting fairness. Code is
available at https://github.com/wnn2000/FFL4MIA.
comment: This paper is accepted at IJCAI'24 (Main Track)
♻ ☆ Long-Tailed Out-of-Distribution Detection: Prioritizing Attention to Tail
Current out-of-distribution (OOD) detection methods typically assume balanced
in-distribution (ID) data, while most real-world data follow a long-tailed
distribution. Previous approaches to long-tailed OOD detection often involve
balancing the ID data by reducing the semantics of head classes. However, this
reduction can severely affect the classification accuracy of ID data. The main
challenge of this task lies in the severe lack of features for tail classes,
leading to confusion with OOD data. To tackle this issue, we introduce a novel
Prioritizing Attention to Tail (PATT) method using augmentation instead of
reduction. Our main intuition involves using a mixture of von Mises-Fisher
(vMF) distributions to model the ID data and a temperature scaling module to
boost the confidence of ID data. This enables us to generate infinite
contrastive pairs, implicitly enhancing the semantics of ID classes while
promoting differentiation between ID and OOD data. To further strengthen the
detection of OOD data without compromising the classification performance of ID
data, we propose feature calibration during the inference phase. By extracting
an attention weight from the training set that prioritizes the tail classes and
reduces the confidence in OOD data, we improve the OOD detection capability.
Extensive experiments verified that our method outperforms the current
state-of-the-art methods on various benchmarks.
comment: Accepted by AAAI'25. Extended version with full appendix, 13 pages